Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Jiang, Ziyue; Ren, Yi; Ye, Zhenhui; Liu, Jinglin; Zhang, Chen; Yang, Qian; Ji, Shengpeng; Huang, Rongjie; Wang, Chunfeng; Yin, Xiang; Ma, Zejun; Zhao, Zhou

电气工程与系统科学 > 音频与语音处理

arXiv:2306.03509 (eess)

[提交于 2023年6月6日 ]

标题： Mega-TTS：具有内在归纳偏置的规模零样本文本到语音

标题： Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Authors:Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

摘要：将文本到语音扩展到大规模且复杂的数据集已被证明在实现音色和语音风格的泛化方面非常有效，特别是在零样本TTS中。然而，以往的工作通常使用音频编解码器将语音编码为潜在表示，并使用自回归语言模型或扩散模型进行生成，这忽略了语音的内在特性，可能导致结果较差或不可控。我们认为语音可以分解为几个属性（例如内容、音色、语调和相位），并且每个属性应使用具有适当归纳偏差的模块进行建模。从这个角度来看，我们精心设计了一个新颖且大型的零样本TTS系统，称为Mega-TTS，它使用大规模的复杂数据进行训练，并以不同的方式对不同属性进行建模：1）而不是使用音频编解码器编码的潜在表示作为中间特征，我们仍然选择频谱图，因为它能很好地分离相位和其他属性。相位可以通过基于GAN的声码器适当构建，不需要由语言模型进行建模。2）我们使用全局向量来建模音色，因为音色是一个随时间缓慢变化的全局属性。3）我们进一步使用基于VQGAN的声学模型生成频谱图，并使用潜在代码语言模型来拟合语调的分布，因为语调在句子中随时间快速变化，而语言模型可以捕捉局部和长距离依赖关系。我们将Mega-TTS扩展到包含20K小时语音的多领域数据集，并在未见过的说话人上评估其性能。实验结果表明，由于每个模块的适当归纳偏差，Mega-TTS在零样本TTS、语音编辑和跨语言TTS任务中超越了最先进的TTS系统，表现出更自然、更稳健和更高的说话人相似性。音频样本可在https://mega-tts.github.io/demo-page获取。

摘要： Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

主题：	音频与语音处理 (eess.AS) ; 人工智能 (cs.AI); 声音 (cs.SD)
引用方式：	arXiv:2306.03509 [eess.AS]
	(或者 arXiv:2306.03509v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2306.03509

提交历史

来自： Ziyue Jiang [查看电子邮件]
[v1] 星期二， 2023 年 6 月 6 日 08:54:49 UTC (1,500 KB)

电气工程与系统科学 > 音频与语音处理

标题： Mega-TTS：具有内在归纳偏置的规模零样本文本到语音

标题： Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： Mega-TTS：具有内在归纳偏置的规模零样本文本到语音 显示英文标题

标题： Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： Mega-TTS：具有内在归纳偏置的规模零样本文本到语音