Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Jolicoeur-Martineau, Alexia

计算机科学 > 人工智能

arXiv:2508.00632 (cs)

[提交于 2025年8月1日 ]

标题：通过音视频记录的多智能体游戏生成与评估

标题： Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Authors:Alexia Jolicoeur-Martineau

摘要：虽然AI在生成文本、音频、图像和视频方面表现出色，但创建交互式音视频内容（如视频游戏）仍然具有挑战性。当前的大型语言模型可以生成JavaScript游戏和动画，但缺乏自动评估指标，并且在处理通常需要多人团队工作数月的复杂内容（多轮、多代理）时遇到困难，这些内容使用的是艺术家制作的资源。为了解决这些问题，我们建立了一个新的度量标准和一个多功能代理系统。我们提出了AVR-Eval，这是一种使用音视频记录（AVRs）的多媒体内容质量相对度量标准。一个全模态模型（处理文本、视频和音频）比较两种内容的AVRs，文本模型审查评估以确定优劣。我们证明AVR-Eval能够正确识别良好内容与损坏或不匹配的内容。我们构建了AVR-Agent，这是一个从多媒体资源库（音频、图像、3D模型）生成JavaScript代码的多代理系统。编码代理选择相关资源，生成多个初始代码，使用AVR-Eval确定最佳版本，并通过来自AVR的全模态代理反馈迭代改进它。我们在游戏和动画上运行了使用AVR-Eval的实验（内容A对内容B的胜率）。我们发现由AVR-Agent生成的内容相对于通过单次生成制作的内容具有显著更高的胜率。然而，模型难以有效利用自定义资源和AVR反馈，显示出没有更高的胜率。这揭示了一个关键差距：尽管人类受益于高质量的资源和音视频反馈，但目前的编码模型似乎并未有效地利用这些资源，突显了人类和机器内容创作方法之间的根本差异。

摘要： While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

主题：	人工智能 (cs.AI) ; 多智能体系统 (cs.MA); 多媒体 (cs.MM)
引用方式：	arXiv:2508.00632 [cs.AI]
	(或者 arXiv:2508.00632v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00632

提交历史

来自： Alexia Jolicoeur-Martineau [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 13:45:13 UTC (5,306 KB)

计算机科学 > 人工智能

标题：通过音视频记录的多智能体游戏生成与评估

标题： Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 通过音视频记录的多智能体游戏生成与评估 显示英文标题

标题： Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过音视频记录的多智能体游戏生成与评估