Rethinking the joint estimation of magnitude and phase for time-frequency domain neural vocoders

Dai, Lingling; Li, Andong; Lei, Tong; Yu, Meng; Li, Xiaodong; Zheng, Chengshi

电气工程与系统科学 > 音频与语音处理

arXiv:2509.18806 (eess)

[提交于 2025年9月23日 ]

标题：重新思考时频域神经声码器中幅度和相位的联合估计

标题： Rethinking the joint estimation of magnitude and phase for time-frequency domain neural vocoders

Authors:Lingling Dai, Andong Li, Tong Lei, Meng Yu, Xiaodong Li, Chengshi Zheng

摘要：基于时频（T-F）域的神经声码器在合成高保真音频方面表现出良好的效果。然而，有效联合预测幅度和相位目标的机制仍不明确。在本文中，我们从两种具有代表性的 T-F 域声码器开始，即 Vocos 和 APNet2，它们分别属于单流和双流模式，用于幅度和相位估计。在大规模数据集上评估它们的性能时，我们意外地观察到 APNet2 的性能严重崩溃。为了稳定其性能，本文中，我们引入了三种简单而有效的策略，分别针对拓扑空间、源空间和输出空间。具体来说，我们修改了架构拓扑以在拓扑空间中实现更好的信息交换，引入先验知识以在源空间中促进生成过程，并优化反向传播过程以在输出空间中通过改进的输出格式进行参数更新。实验结果表明，我们提出的方法有效促进了 APNet2 中幅度和相位的联合估计，从而弥合了单流和双流声码器之间的性能差异。

摘要： Time-frequency (T-F) domain-based neural vocoders have shown promising results in synthesizing high-fidelity audio. Nevertheless, it remains unclear on the mechanism of effectively predicting magnitude and phase targets jointly. In this paper, we start from two representative T-F domain vocoders, namely Vocos and APNet2, which belong to the single-stream and dual-stream modes for magnitude and phase estimation, respectively. When evaluating their performance on a large-scale dataset, we accidentally observe severe performance collapse of APNet2. To stabilize its performance, in this paper, we introduce three simple yet effective strategies, each targeting the topological space, the source space, and the output space, respectively. Specifically, we modify the architectural topology for better information exchange in the topological space, introduce prior knowledge to facilitate the generation process in the source space, and optimize the backpropagation process for parameter updates with an improved output format in the output space. Experimental results demonstrate that our proposed method effectively facilitates the joint estimation of magnitude and phase in APNet2, thus bridging the performance disparities between the single-stream and dual-stream vocoders.

评论：	提交至ICASSP 2026
主题：	音频与语音处理 (eess.AS)
引用方式：	arXiv:2509.18806 [eess.AS]
	(或者 arXiv:2509.18806v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.18806

提交历史

来自： Lingling Dai [查看电子邮件]
[v1] 星期二， 2025 年 9 月 23 日 08:57:13 UTC (1,898 KB)

电气工程与系统科学 > 音频与语音处理

标题：重新思考时频域神经声码器中幅度和相位的联合估计

标题： Rethinking the joint estimation of magnitude and phase for time-frequency domain neural vocoders

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： 重新思考时频域神经声码器中幅度和相位的联合估计 显示英文标题

标题： Rethinking the joint estimation of magnitude and phase for time-frequency domain neural vocoders

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：重新思考时频域神经声码器中幅度和相位的联合估计