Frame Sampling Strategies Matter: A Benchmark for small vision language models

Brkic, Marija; Razzouki, Anas Filali; Tevissen, Yannis; Guetari, Khalil; Yacoubi, Mounim A. El

计算机科学 > 计算机视觉与模式识别

arXiv:2509.14769v1 (cs)

[提交于 2025年9月18日 ]

标题：框架采样策略很重要：小视觉语言模型的基准

标题： Frame Sampling Strategies Matter: A Benchmark for small vision language models

Authors:Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi

摘要：在视频上比较视觉语言模型尤其复杂，因为性能是由模型的视觉表示能力以及用于构建输入的帧采样策略共同决定的。当前的视频基准测试可能受到显著的帧采样偏差的影响，因为模型是使用不同的帧选择策略进行评估的。在本工作中，我们提出了第一个针对视频问答任务的最新小型VLMs的帧精确基准，在受控的帧采样策略下进行评估。我们的结果证实了所怀疑的偏差，并突显了在不同帧采样技术下SVLMs的数据特定和任务特定行为。通过开源我们的基准测试代码，我们为社区提供了一个可重复且无偏的协议，用于评估视频VLMs，并强调了未来研究中需要针对每个基准数据集定制标准化的帧采样策略。

摘要： Comparing vision language models on videos is particularly complex, as the performances is jointly determined by the model's visual representation capacity and the frame-sampling strategy used to construct the input. Current video benchmarks are suspected to suffer from substantial frame-sampling bias, as models are evaluated with different frame selection strategies. In this work, we propose the first frame-accurate benchmark of state-of-the-art small VLMs for video question-answering, evaluated under controlled frame-sampling strategies. Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques. By open-sourcing our benchmarking code, we provide the community with a reproducible and unbiased protocol for evaluating video VLMs and emphasize the need for standardized frame-sampling strategies tailored to each benchmarking dataset in future research.

主题：	计算机视觉与模式识别 (cs.CV) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2509.14769 [cs.CV]
	(或者 arXiv:2509.14769v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.14769

提交历史

来自： Yannis Tevissen [查看电子邮件]
[v1] 星期四， 2025 年 9 月 18 日 09:18:42 UTC (44 KB)

计算机科学 > 计算机视觉与模式识别

标题：框架采样策略很重要：小视觉语言模型的基准

标题： Frame Sampling Strategies Matter: A Benchmark for small vision language models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 框架采样策略很重要：小视觉语言模型的基准 显示英文标题

标题： Frame Sampling Strategies Matter: A Benchmark for small vision language models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：框架采样策略很重要：小视觉语言模型的基准