End-to-End Joint Target and Non-Target Speakers ASR

Masumura, Ryo; Makishima, Naoki; Yamane, Taiga; Yamazaki, Yoshihiko; Mizuno, Saki; Ihori, Mana; Uchida, Mihiro; Suzuki, Keita; Sato, Hiroshi; Tanaka, Tomohiro; Takashima, Akihiko; Suzuki, Satoshi; Moriya, Takafumi; Hojo, Nobukatsu; Ando, Atsushi

计算机科学 > 计算与语言

arXiv:2306.02273 (cs)

[提交于 2023年6月4日 ]

标题：端到端联合目标和非目标说话人语音识别

标题： End-to-End Joint Target and Non-Target Speakers ASR

Authors:Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

摘要：本文提出了一种新颖的自动语音识别（ASR）系统，该系统可以从多说话人重叠语音中转录个别说话人的语音，并识别他们是否为目标说话人或非目标说话人。目标说话人ASR系统是一种有前景的方法，可以通过注册目标说话人的信息仅转录目标说话人的语音。然而，在对话式ASR应用中，通常需要转录目标说话人和非目标说话人的语音以理解交互信息。为了在单一ASR模型中自然地考虑目标和非目标说话人，我们的想法是扩展基于自回归建模的多说话人ASR系统，以利用目标说话人的注册语音。我们提出的ASR通过递归生成文本标记以及表示目标或非目标说话人的标记来执行。我们的实验证明了我们所提出方法的有效性。

摘要： This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applications, transcribing both the target speaker's speech and non-target speakers' ones is often required to understand interactive information. To naturally consider both target and non-target speakers in a single ASR model, our idea is to extend autoregressive modeling-based multi-talker ASR systems to utilize the enrollment speech of the target speaker. Our proposed ASR is performed by recursively generating both textual tokens and tokens that represent target or non-target speakers. Our experiments demonstrate the effectiveness of our proposed method.

评论：	被Interspeech 2023接收
主题：	计算与语言 (cs.CL) ; 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2306.02273 [cs.CL]
	(或者 arXiv:2306.02273v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2306.02273

提交历史

来自： Ryo Masumura [查看电子邮件]
[v1] 星期日， 2023 年 6 月 4 日 06:38:15 UTC (79 KB)

计算机科学 > 计算与语言

标题：端到端联合目标和非目标说话人语音识别

标题： End-to-End Joint Target and Non-Target Speakers ASR

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 端到端联合目标和非目标说话人语音识别 显示英文标题

标题： End-to-End Joint Target and Non-Target Speakers ASR

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：端到端联合目标和非目标说话人语音识别