Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Kögel, Fabian; Nguyen, Bac; Cardinaux, Fabien

doi:10.21437/Interspeech.2023-879

计算机科学 > 声音

arXiv:2306.01442 (cs)

[提交于 2023年6月2日 ]

标题：通过建模残差多模态实现稳健的FastSpeech 2

标题： Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Authors:Fabian Kögel, Bac Nguyen, Fabien Cardinaux

摘要：基于FastSpeech 2的最新非自回归文本到语音(TTS)模型可以高效地合成高质量和自然的语音。然而，对于富有表现力的语音数据集，我们观察到特征性的音频失真。我们证明，这些伪影是由于使用均方误差(MSE)损失训练频谱图解码器时，对频谱图预测进行了过度平滑，从而引入到声码器重建中的。使用MSE损失的FastSpeech 2仅能学习训练分布的条件平均值，如果在所有条件信号之后分布仍然呈现多模态，则这些平均值可能远离自然样本。为缓解此问题，我们引入了TVC-GMM，这是一种三元链高斯分布的混合模型，用于建模残余多模态性。TVC-GMM减少了频谱图的平滑度，并通过客观和主观评估显示，特别是在富有表现力的数据集上，提高了感知音频质量。

摘要： State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.

评论：	被INTERSPEECH 2023接受
主题：	声音 (cs.SD) ; 计算与语言 (cs.CL); 机器学习 (cs.LG); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2306.01442 [cs.SD]
	(或者 arXiv:2306.01442v1 [cs.SD] 对于此版本)
	https://doi.org/10.48550/arXiv.2306.01442
相关 DOI:	https://doi.org/10.21437/Interspeech.2023-879

提交历史

来自： Fabian Kögel [查看电子邮件]
[v1] 星期五， 2023 年 6 月 2 日 11:03:26 UTC (477 KB)

计算机科学 > 声音

标题：通过建模残差多模态实现稳健的FastSpeech 2

标题： Towards Robust FastSpeech 2 by Modelling Residual Multimodality

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 声音

标题： 通过建模残差多模态实现稳健的FastSpeech 2 显示英文标题

标题： Towards Robust FastSpeech 2 by Modelling Residual Multimodality

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过建模残差多模态实现稳健的FastSpeech 2