Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, Hang; Li, Xin; Bing, Lidong

计算机科学 > 计算与语言

arXiv:2306.02858v4 (cs)

[提交于 2023年6月5日 (v1) ，最后修订 2023年10月25日 (此版本， v4)]

标题：视频-LLaMA：用于视频理解的指令调优的视听语言模型

标题： Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Authors:Hang Zhang, Xin Li, Lidong Bing

摘要：我们提出了Video-LLaMA，这是一个多模态框架，使大型语言模型（LLMs）具备理解视频中视觉和听觉内容的能力。Video-LLaMA从冻结的预训练视觉和音频编码器以及冻结的LLMs中进行跨模态训练。与之前仅补充LLMs来处理视觉或音频信号的工作不同，Video-LLaMA通过解决两个挑战来实现视频理解：(1) 捕捉视觉场景的时间变化，(2) 整合视听信号。为应对第一个挑战，我们提出了一种Video Q-former，将预训练图像编码器整合到我们的视频编码器中，并引入了视频到文本生成任务以学习视频与语言的对应关系。对于第二个挑战，我们利用ImageBind，一个对齐多种模态的通用嵌入模型，作为预训练音频编码器，并在ImageBind之上引入了一个Audio Q-former，以学习合理的听觉查询嵌入用于LLM模块。为了将视觉和音频编码器的输出与LLM的嵌入空间对齐，我们首先在大量视频/图像-标题对上训练Video-LLaMA，然后使用中等数量但质量更高的视觉指令数据集来微调我们的模型。我们发现Video-LLaMA表现出感知和理解视频内容的能力，并能基于视频中呈现的视觉和听觉信息生成有意义的响应。

摘要： We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

评论：	被EMNLP 2023的演示轨道接受；代码、预训练模型和数据集：https://github.com/DAMO-NLP-SG/Video-LLaMA
主题：	计算与语言 (cs.CL) ; 计算机视觉与模式识别 (cs.CV); 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2306.02858 [cs.CL]
	(或者 arXiv:2306.02858v4 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2306.02858

提交历史

来自： Hang Zhang [查看电子邮件]
[v1] 星期一， 2023 年 6 月 5 日 13:17:27 UTC (2,864 KB)
[v2] 星期二， 2023 年 6 月 6 日 12:28:37 UTC (2,863 KB)
[v3] 星期一， 2023 年 6 月 12 日 02:28:57 UTC (2,864 KB)
[v4] 星期三， 2023 年 10 月 25 日 06:23:31 UTC (2,870 KB)

计算机科学 > 计算与语言

标题：视频-LLaMA：用于视频理解的指令调优的视听语言模型

标题： Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 视频-LLaMA：用于视频理解的指令调优的视听语言模型 显示英文标题

标题： Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：视频-LLaMA：用于视频理解的指令调优的视听语言模型