Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Lin, Junan; Liu, Daizong; Chen, Xianke; Qu, Xiaoye; Yang, Xun; Zhu, Jixiang; Zhang, Sanyuan; Dong, Jianfeng

计算机科学 > 信息检索

arXiv:2508.04273 (cs)

[提交于 2025年8月6日 (v1) ，最后修订 2025年10月25日 (此版本， v3)]

标题：音频确实重要：用于视频时刻检索的重要性的多粒度融合

标题： Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Authors:Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong

摘要：视频时刻检索（VMR）旨在检索与给定查询在语义上相关的特定时刻。为了处理这个任务，现有的大多数VMR方法仅关注视觉和文本模态，而忽略了互补但重要的音频模态。尽管最近有一些工作尝试解决联合的音频-视觉-文本推理，但它们将所有模态同等对待，并简单地将它们嵌入，而没有为时刻检索进行细粒度的交互。这些设计不切实际，因为：并非所有音频都有助于视频时刻检索，某些视频的音频可能是完整的噪声或背景声音，对时刻确定毫无意义。为此，我们提出了一种新颖的重要程度感知多粒度融合模型（IMG），该模型学习动态且有选择地聚合音频-视觉-文本上下文用于VMR。具体来说，在将文本指导分别与视觉和音频集成之后，我们首先设计了一个伪标签监督的音频重要性预测器，该预测器预测音频的重要性得分，并相应地分配权重以减轻噪声音频引起的干扰。然后，我们设计了一个多粒度音频融合模块，该模块在局部、事件和全局层面自适应地融合音频和视觉模态，充分捕捉它们的互补上下文。我们进一步提出了一种跨模态知识蒸馏策略，以解决推理过程中音频模态缺失的挑战。为了评估我们的方法，我们进一步构建了一个新的VMR数据集，即Charades-AudioMatter，其中音频相关的样本是从原始Charades-STA中手动选择并重新组织的，以验证模型利用音频模态的能力。大量实验验证了我们方法的有效性，在VMR方法中实现了基于音频-视频融合的最先进性能。我们的代码可在https://github.com/HuiGuanLab/IMG获取。

摘要： Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.

评论：	被ACM MM 2025接收
主题：	信息检索 (cs.IR) ; 计算机视觉与模式识别 (cs.CV); 多媒体 (cs.MM); 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.04273 [cs.IR]
	(或者 arXiv:2508.04273v3 [cs.IR] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.04273

提交历史

来自： Junan Lin [查看电子邮件]
[v1] 星期三， 2025 年 8 月 6 日 09:58:43 UTC (4,237 KB)
[v2] 星期六， 2025 年 10 月 11 日 12:07:19 UTC (4,254 KB)
[v3] 星期六， 2025 年 10 月 25 日 03:36:49 UTC (4,254 KB)

计算机科学 > 信息检索

标题：音频确实重要：用于视频时刻检索的重要性的多粒度融合

标题： Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 信息检索

标题： 音频确实重要：用于视频时刻检索的重要性的多粒度融合 显示英文标题

标题： Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：音频确实重要：用于视频时刻检索的重要性的多粒度融合