LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Guo, Zirun; Zhang, Feng; Jia, Kai; Jin, Tao

计算机科学 > 机器学习

arXiv:2509.13642 (cs)

[提交于 2025年9月17日 ]

标题： LLM-I：大语言模型是自然交错的多模态创作者

标题： LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Authors:Zirun Guo, Feng Zhang, Kai Jia, Tao Jin

摘要：我们提出LLM-Interleaved（LLM-I），一个灵活且动态的框架，将交错的图像文本生成重新表述为工具使用问题。 LLM-I旨在克服当前统一模型的“单一工具”瓶颈，这些模型仅限于合成图像，并在需要事实依据或程序精度的任务上表现不佳。我们的框架使一个中心LLM或MLLM代理能够智能地协调多种专业视觉工具，包括在线图像搜索、基于扩散的生成、代码执行和图像编辑。该代理通过一种结合基于规则的逻辑与LLM和MLLM评估者的判断的混合奖励系统的强化学习（RL）框架进行训练，以熟练选择和应用这些工具。在使用四种不同模型主干的新数据集上进行训练，LLM-I表现出最先进的性能，在四个基准测试中大幅优于现有方法。我们还引入了一种新的测试时缩放策略，进一步提升了性能。项目页面：https://github.com/ByteDance-BandAI/LLM-I.

摘要： We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

主题：	机器学习 (cs.LG) ; 计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2509.13642 [cs.LG]
	(或者 arXiv:2509.13642v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.13642

提交历史

来自： Zirun Guo [查看电子邮件]
[v1] 星期三， 2025 年 9 月 17 日 02:33:29 UTC (18,305 KB)

计算机科学 > 机器学习

标题： LLM-I：大语言模型是自然交错的多模态创作者

标题： LLM-I: LLMs are Naturally Interleaved Multimodal Creators

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： LLM-I：大语言模型是自然交错的多模态创作者 显示英文标题

标题： LLM-I: LLMs are Naturally Interleaved Multimodal Creators

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LLM-I：大语言模型是自然交错的多模态创作者