SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Pham, Kien T.; He, Yingqing; Xing, Yazhou; Chen, Qifeng; Chen, Long

计算机科学 > 图形学

arXiv:2508.00782 (cs)

[提交于 2025年8月1日 ]

标题： SpA2V：利用空间听觉线索进行音频驱动的空间感知视频生成

标题： SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Authors:Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen

摘要：音频驱动的视频生成旨在合成与输入音频记录对齐的逼真视频，类似于人类从听觉输入中可视化场景的能力。然而，现有方法主要关注探索语义信息，例如音频中存在的发声源类别，这限制了它们生成具有准确内容和空间构图的视频的能力。相比之下，我们人类不仅可以自然地识别发声源的语义类别，还可以确定其深度编码的空间属性，包括位置和运动方向。这种有用的信息可以通过考虑从声音固有物理特性（如响度或频率）中得出的具体空间指标来阐明。由于先前的方法大多忽略了这一因素，我们提出了SpA2V，这是第一个显式利用这些空间听觉线索从音频生成具有高语义和空间对应关系的视频的框架。 SpA2V将生成过程分为两个阶段：1）音频引导的视频规划：我们仔细调整一种最先进的MLLM，以执行一项新任务，即利用输入音频中的空间和语义线索来构建视频场景布局（VSLs）。这作为音频和视频模态之间的中间表示，弥合了两者之间的差距。 2）基于布局的视频生成：我们开发了一种高效且有效的方法，将VSLs无缝集成到预训练扩散模型中作为条件指导，从而在无需训练的情况下实现基于VSL的视频生成。大量实验表明，SpA2V在生成与输入音频在语义和空间上对齐的逼真视频方面表现出色。

摘要： Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

评论：	第33届ACM多媒体会议（MM '25）
主题：	图形学 (cs.GR) ; 人工智能 (cs.AI); 计算机视觉与模式识别 (cs.CV); 多媒体 (cs.MM); 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.00782 [cs.GR]
	(或者 arXiv:2508.00782v1 [cs.GR] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00782

提交历史

来自： Trung Kien Pham [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 17:05:04 UTC (5,984 KB)

计算机科学 > 图形学

标题： SpA2V：利用空间听觉线索进行音频驱动的空间感知视频生成

标题： SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 图形学

标题： SpA2V：利用空间听觉线索进行音频驱动的空间感知视频生成 显示英文标题

标题： SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： SpA2V：利用空间听觉线索进行音频驱动的空间感知视频生成