EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Xie, Tianxin; Yang, Shan; Li, Chenxing; Yu, Dong; Liu, Li

计算机科学 > 声音

arXiv:2508.03543 (cs)

[提交于 2025年8月5日 (v1) ，最后修订 2025年10月25日 (此版本， v3)]

标题： EmoSteer-TTS：通过激活引导实现细粒度和无需训练的情感可控文本转语音

标题： EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Authors:Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

摘要：文本到语音（TTS）近年来取得了显著进展。然而，大多数现有的TTS系统仅提供粗粒度和僵硬的情感控制，通常通过离散的情感标签或精心设计和详细的带有情感的文本提示来实现，这使得细粒度的情感操作要么不可用，要么不稳定。这些模型还需要大量的高质量数据集进行训练。为了解决这些限制，我们提出了EmoSteer-TTS，这是一种新颖的无需训练的方法，通过激活引导实现细粒度的语音情感控制（转换、插值、消除）。我们首先通过实验观察到，在基于流匹配的TTS模型内部激活的一部分进行修改可以有效地改变合成语音的情感基调。在此基础上，我们开发了一种无需训练且高效的算法，包括激活提取、情感标记搜索和推理时的引导，可以无缝集成到各种预训练模型中（例如F5-TTS、CosyVoice2和E2-TTS）。此外，为了获得有效的引导向量，我们构建了一个具有多样说话者的精选情感语音数据集。大量实验表明，EmoSteer-TTS实现了对语音情感的细粒度、可解释和连续的控制，优于最先进的（SOTA）方法。据我们所知，这是第一个在TTS中实现无需训练和连续细粒度情感控制的方法。演示样本可在https://emosteer-tts-demo.pages.dev/获取。

摘要： Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

评论：	25页，9图，3表
主题：	声音 (cs.SD) ; 人工智能 (cs.AI); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.03543 [cs.SD]
	(或者 arXiv:2508.03543v3 [cs.SD] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.03543

提交历史

来自： Tianxin Xie [查看电子邮件]
[v1] 星期二， 2025 年 8 月 5 日 15:12:49 UTC (1,725 KB)
[v2] 星期三， 2025 年 8 月 6 日 06:54:21 UTC (1,725 KB)
[v3] 星期六， 2025 年 10 月 25 日 12:11:44 UTC (1,833 KB)

计算机科学 > 声音

标题： EmoSteer-TTS：通过激活引导实现细粒度和无需训练的情感可控文本转语音

标题： EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 声音

标题： EmoSteer-TTS：通过激活引导实现细粒度和无需训练的情感可控文本转语音 显示英文标题

标题： EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： EmoSteer-TTS：通过激活引导实现细粒度和无需训练的情感可控文本转语音