Skip to main content
CenXiv.org
此网站处于试运行阶段,支持我们!
我们衷心感谢所有贡献者的支持。
贡献
赞助
cenxiv logo > cs.MM

帮助 | 高级搜索

多媒体

  • 新提交
  • 交叉列表
  • 替换

查看 最近的 文章

显示 2025年10月08日, 星期三 新的列表

总共 8 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有

新提交 (展示 2 之 2 条目 )

[1] arXiv:2510.05839 [中文pdf, pdf, html, 其他]
标题: 面向不完整模态的鲁棒和可靠的多模态虚假新闻检测
标题: Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality
Hengyang Zhou, Yiwei Wei, Jian Yang, Zhenyu Zhang
主题: 多媒体 (cs.MM) ; 计算机视觉与模式识别 (cs.CV)

多模态虚假新闻检测(MFND)随着社交媒体平台上大量多模态虚假内容的出现已成为一项紧迫任务。 以往的研究主要集中在复杂特征提取和融合上,以从多模态内容中学习区分性信息。 然而,在实际应用中,多媒体新闻在传播过程中可能会自然丢失一些信息,导致模态不完整,这对现有模型的泛化能力和鲁棒性是有害的。 为此,我们提出了一种新颖的通用且鲁棒的多模态融合策略,称为多专家模态不完整学习网络(MMLNet),它简单而有效。 它包括三个关键步骤:(1)多专家协作推理,通过多个专家动态利用互补信息来补偿缺失的模态。 (2)模态不完整适配器,通过利用新的特征分布来补偿缺失的信息。 (3)模态缺失学习,利用标签感知的自适应加权策略,通过对比学习来学习一个鲁棒的表示。 我们在两种语言的三个真实世界基准上评估了MMLNet,结果表明其性能优于最先进的方法,同时保持相对简单的结构。 通过确保因信息传播导致的模态不完整场景下的虚假新闻检测准确性,MMLNet有效地遏制了恶意虚假信息的传播。 代码可在 https://github.com/zhyhome/MMLNet 公开获取。

Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.

[2] arXiv:2510.06060 [中文pdf, pdf, html, 其他]
标题: 从360°空间信息中可控的音视频视角生成
标题: Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
Christian Marinoni, Riccardo Fosco Gramaccioni, Eleonora Grassucci, Danilo Comminiello
主题: 多媒体 (cs.MM) ; 人工智能 (cs.AI) ; 计算机视觉与模式识别 (cs.CV)

声音视频的生成随着扩散模型的出现取得了显著进展。 然而,现有方法通常缺乏从更大、沉浸式的360度环境中生成特定视角内容所需的细粒度控制。 这一限制阻碍了对离镜头事件有所感知的音画体验的创建。 据我们所知,这是首次引入可控音画生成框架的工作,解决了这一未探索的空白。 具体而言,我们通过引入一组从完整360度空间中得出的强大条件信号来提出一种扩散模型:一个全景显著性图以识别感兴趣区域,一个边界框感知的有符号距离图以定义目标视角,以及整个场景的描述性标题。 通过整合这些控制,我们的模型生成的空间感知视角视频和音频受到更广泛未见环境上下文的连贯影响,引入了对于真实和沉浸式音画生成至关重要的强控制能力。 我们展示了音画示例证明了我们框架的有效性。

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

交叉提交 (展示 5 之 5 条目 )

[3] arXiv:2510.05295 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
标题: AUREXA-SE:具有交叉注意力和挤压变换器的视听统一表示交换架构用于语音增强
标题: AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement
M. Sajid, Deepanshu Gupta, Yash Modi, Sanskriti Jain, Harshith Jai Surya Ganji, A. Rahaman, Harshvardhan Choudhary, Nasir Saleem, Amir Hussain, M. Tanveer
主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 多媒体 (cs.MM)

在本文中,我们提出AUREXA-SE(带有交叉注意力和挤压转换器的音频-视觉统一表示交换架构,用于语音增强),一种逐步的双模态框架,专为音频-视觉语音增强(AVSE)设计。AUREXA-SE通过使用基于U-Net的1D卷积编码器处理原始音频波形和视觉线索,并使用Swin Transformer V2进行高效且富有表现力的视觉特征提取,共同利用这些信息。该架构的核心是一种新颖的双向交叉注意力机制,促进了模态之间的深度上下文融合,实现了丰富且互补的表示学习。为了捕捉融合嵌入中的时间依赖性,引入了一堆轻量级的Squeezeformer块,结合了卷积和注意力模块。然后通过U-Net风格的解码器对增强的嵌入进行解码,以直接进行波形重建,确保感知一致且可理解的语音输出。实验评估证明了AUREXA-SE的有效性,在嘈杂基线上取得了显著的性能提升,STOI为0.516,PESQ为1.323,SI-SDR为-4.322 dB。AUREXA-SE的源代码可在https://github.com/mtanveer1/AVSEC-4-Challenge-2025获取。

In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin Transformer V2 for efficient and expressive visual feature extraction. Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities, enabling rich and complementary representation learning. To capture temporal dependencies within the fused embeddings, a stack of lightweight Squeezeformer blocks combining convolutional and attention modules is introduced. The enhanced embeddings are then decoded via a U-Net-style decoder for direct waveform reconstruction, ensuring perceptually consistent and intelligible speech output. Experimental evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant performance improvements over noisy baselines, with STOI of 0.516, PESQ of 1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at https://github.com/mtanveer1/AVSEC-4-Challenge-2025.

[4] arXiv:2510.05661 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
标题: 何时以及如何剪辑古典音乐会? 一种多模态自动视频编辑方法
标题: When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach
Daniel Gonzálbez-Biosca, Josep Cabacas-Maso, Carles Ventura, Ismael Benito-Altamirano
主题: 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)

自动化视频编辑在计算机视觉和多媒体领域仍然是一个研究不足的任务,尤其是在与视频生成和场景理解日益增长的兴趣相比时更为明显。 在本工作中,我们解决了编辑古典音乐音乐会多摄像机录制视频的具体挑战,将问题分解为两个关键子任务:何时剪辑和如何剪辑。 基于最近的文献,我们提出了一种新颖的多模态架构用于时间分割任务(何时剪辑),该架构结合了音频信号的对数梅尔频谱图,以及可选的图像嵌入和通过轻量级卷积-变压器管道的标量时间特征。 对于空间选择任务(如何剪辑),我们通过使用基于CLIP的编码器更新旧的主干网络(如ResNet),并限制干扰项的选择仅限于同一场音乐会的片段,从而改进了文献。 我们的数据集是通过伪标签方法构建的,其中原始视频数据被自动聚类为连贯的镜头段。 我们展示了我们的模型在检测剪辑点方面优于之前的基线,并提供了具有竞争力的视觉镜头选择,推动了多模态自动化视频编辑的最新进展。

Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

[5] arXiv:2510.05828 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
标题: 立体同步:从视频中生成空间感知的立体音频
标题: StereoSync: Spatially-Aware Stereo Audio Generation from Video
Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada, Takashi Shibuya, Yuki Mitsufuji, Danilo Comminiello
评论: 被IJCNN 2025接收
主题: 声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS)

尽管近年来音频生成已被广泛研究,但视频对齐的音频生成仍然是一片相对未探索的领域。 为解决这一差距,我们引入了StereoSync,这是一种新颖且高效的模型,旨在生成与参考视频在时间上同步并且在空间上与视觉上下文对齐的音频。 此外, StereoSync通过利用预训练的基础模型实现了效率,减少了对大量训练的需求,同时保持高质量的合成。 与主要关注时间同步的现有方法不同,StereoSync通过在视频对齐的音频生成中引入空间意识,实现了重大进展。 事实上,给定一个输入视频,我们的方法从深度图和边界框中提取空间线索,并将其用作基于扩散的音频生成模型中的交叉注意力条件。 这种方法使StereoSync能够超越简单的同步,生成动态适应视频场景的空间结构和运动的立体音频。 我们在Walking The Maps上评估StereoSync,这是一个精心整理的数据集,包含来自视频游戏的视频,这些视频展示了动画角色在各种环境中行走。 实验结果表明,StereoSync能够实现时间和空间对齐,推动了视频到音频生成的技术发展,并带来了更加沉浸和真实的音频体验。

Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.

[6] arXiv:2510.05829 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
标题: FoleyGRAM:具有GRAM对齐多模态编码器的视频到音频生成
标题: FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
评论: 被IJCNN 2025接收
主题: 声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS)

在本工作中,我们提出了FoleyGRAM,这是一种视频到音频生成的新方法,通过使用对齐的多模态编码器来强调语义条件。 在先前视频到音频生成进展的基础上,FoleyGRAM利用Gramian表示对齐度量(GRAM)来对齐视频、文本和音频模态的嵌入,从而实现对音频生成过程的精确语义控制。 FoleyGRAM的核心是一个基于扩散的音频合成模型,该模型以GRAM对齐的嵌入和波形包络为条件,确保与相应输入视频的语义丰富性和时间对齐。 我们在Greatest Hits数据集上评估FoleyGRAM,这是一个视频到音频模型的标准基准。 我们的实验表明,使用GRAM对齐多模态编码器可以增强系统将生成的音频与视频内容进行语义对齐的能力,推动了视频到音频合成的技术进步。

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

[7] arXiv:2510.05881 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
标题: 分段因子化符号钢琴音乐全曲生成
标题: Segment-Factorized Full-Song Generation on Symbolic Piano Music
Ping-Yi Chen, Chih-Pin Tan, Yi-Hsuan Yang
评论: 被第39届神经信息处理系统大会(NeurIPS 2025)研讨会:人工智能与音乐接受
主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS)

我们提出了分段全曲模型(SFS)用于符号化全曲生成。 该模型接受用户提供的歌曲结构和一个可选的短种子片段,该片段围绕歌曲的主要想法进行构建。 通过将歌曲分解为段落并通过选择性关注相关段落来生成每个段落,与之前的工作相比,该模型实现了更高的质量和效率。 为了展示其适用于人机交互,我们进一步将SFS封装成一个网络应用程序,使用户能够在可定制的结构和灵活的顺序下,迭代地与钢琴卷帘带共同创作音乐。

We propose the Segmented Full-Song Model (SFS) for symbolic full-song generation. The model accepts a user-provided song structure and an optional short seed segment that anchors the main idea around which the song is developed. By factorizing a song into segments and generating each one through selective attention to related segments, the model achieves higher quality and efficiency compared to prior work. To demonstrate its suitability for human-AI interaction, we further wrap SFS into a web application that enables users to iteratively co-create music on a piano roll with customizable structures and flexible ordering.

替换提交 (展示 1 之 1 条目 )

[8] arXiv:2508.15690 (替换) [中文pdf, pdf, html, 其他]
标题: GRAFT:文本对齐的图和表推理——结构化指令遵循和视觉推理的基准
标题: GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran
评论: 25页,10张表格,3幅图
主题: 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM)

GRAFT 是一个结构化的多模态基准,用于在指令遵循、视觉推理和视觉文本对齐任务上评估模型。 它包含程序生成的图表和合成渲染的表格,使用 Python 可视化库创建,以确保对数据语义、结构和清晰度的控制。 每个 GRAFT 实例将一张图表或表格图像与一个基于视觉内容系统生成的多步骤分析问题配对。 答案以结构化格式提供,如 JSON 或 YAML,支持对推理和输出格式的一致评估。 该基准引入了包括比较、趋势识别、排名、聚合、比例估计和异常检测在内的推理类型分类,以实现全面评估。 参考答案遵循严格的事实和格式指南,以便进行精确的基于方面的评估。 GRAFT 提供了一个统一且可扩展的框架,用于在视觉基础的结构化推理任务上对多模态模型进行细粒度基准测试,为该领域设定了新的评估标准。

GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.

总共 8 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有
  • 关于
  • 帮助
  • contact arXivClick here to contact arXiv 联系
  • 订阅 arXiv 邮件列表点击这里订阅 订阅
  • 版权
  • 隐私政策
  • 网络无障碍帮助
  • arXiv 运营状态
    通过...获取状态通知 email 或者 slack

京ICP备2025123034号