Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Vlachos, Angelos; Filandrianos, Giorgos; Lymperaiou, Maria; Spanos, Nikolaos; Mitsouras, Ilias; Karampinis, Vasileios; Voulodimos, Athanasios

计算机科学 > 计算机视觉与模式识别

arXiv:2508.00356 (cs)

[提交于 2025年8月1日 ]

标题：分析-提示-推理：一种多图像视觉语言推理的协作代理框架

标题： Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Authors:Angelos Vlachos, Giorgos Filandrianos, Maria Lymperaiou, Nikolaos Spanos, Ilias Mitsouras, Vasileios Karampinis, Athanasios Voulodimos

摘要：我们提出了一种基于协作代理的多图像推理框架。我们的方法通过采用双代理系统来解决跨不同数据集和任务格式的交错多模态推理挑战：一个基于语言的PromptEngineer，生成上下文感知、任务特定的提示，以及一个VisionReasoner，一个负责最终推理的大规模视觉-语言模型（LVLM）。该框架完全自动化、模块化且无需训练，能够泛化到涉及一个或多个输入图像的分类、问答和自由形式生成任务。我们在2025年MIRAGE挑战赛（A赛道）的18个多样化数据集上评估了我们的方法，涵盖了从文档问答、视觉比较、基于对话的理解到场景级推理的各种视觉推理任务。我们的结果表明，当由信息提示引导时，LVLM可以有效地对多张图像进行推理。值得注意的是，Claude 3.7在具有挑战性的任务如TQA（99.13%准确率）、DocVQA（96.87%）和MMCoQA（75.28 ROUGE-L）上达到了接近上限的性能。我们还探讨了设计选择——如模型选择、样本数量和输入长度——如何影响不同LVLM的推理性能。

摘要： We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.

主题：	计算机视觉与模式识别 (cs.CV) ; 多智能体系统 (cs.MA)
ACM 类：	I.2; I.2.7
引用方式：	arXiv:2508.00356 [cs.CV]
	(或者 arXiv:2508.00356v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00356

提交历史

来自： Maria Lymperaiou [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 06:39:15 UTC (7,033 KB)

计算机科学 > 计算机视觉与模式识别

标题：分析-提示-推理：一种多图像视觉语言推理的协作代理框架

标题： Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 分析-提示-推理：一种多图像视觉语言推理的协作代理框架 显示英文标题

标题： Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：分析-提示-推理：一种多图像视觉语言推理的协作代理框架