Inferring Dynamic Physical Properties from Video Foundation Models

Zhan, Guanqi; Ma, Xianzheng; Xie, Weidi; Zisserman, Andrew

计算机科学 > 计算机视觉与模式识别

arXiv:2510.02311 (cs)

[提交于 2025年10月2日 ]

标题：从视频基础模型推断动态物理属性

标题： Inferring Dynamic Physical Properties from Video Foundation Models

Authors:Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

摘要：我们研究从视频中预测动态物理特性的任务。更具体地说，我们考虑需要时间信息来推断的物理特性：弹跳物体的弹性、流动液体的粘度以及在表面上滑动物体的动态摩擦力。为此，我们做出了以下贡献：(i) 我们为每个物理特性收集了一个新的视频数据集，包括合成训练和测试分割，以及一个用于现实世界评估的真实分割。 (ii) 我们探索了三种从视频中推断物理特性的方法：(a) 一种理想方法，我们使用经典计算机视觉技术提供内在反映该特性的视觉线索；(b) 一种简单的读出机制，使用视觉提示和可训练提示向量对预训练视频生成和自监督模型进行交叉注意力；以及 (c) 多模态大语言模型（MLLMs）的提示策略。 (iii) 我们表明，以生成或自监督方式训练的视频基础模型表现出相似的性能，尽管不如理想方法，而MLLMs目前不如其他模型，但通过适当的提示可以提高其性能。

摘要： We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

主题：	计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
引用方式：	arXiv:2510.02311 [cs.CV]
	(或者 arXiv:2510.02311v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.02311

提交历史

来自： Guanqi Zhan [查看电子邮件]
[v1] 星期四， 2025 年 10 月 2 日 17:59:50 UTC (10,750 KB)

计算机科学 > 计算机视觉与模式识别

标题：从视频基础模型推断动态物理属性

标题： Inferring Dynamic Physical Properties from Video Foundation Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 从视频基础模型推断动态物理属性 显示英文标题

标题： Inferring Dynamic Physical Properties from Video Foundation Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：从视频基础模型推断动态物理属性