A Compute-Matched Re-Evaluation of TroVE on MATH

Sesterhenn, Tobias; Berlot-Attwell, Ian; Zenkner, Janis; Bartelt, Christian

计算机科学 > 编程语言

arXiv:2507.22069 (cs)

[提交于 2025年7月16日 (v1) ，最后修订 2025年7月31日 (此版本， v2)]

标题：一个计算匹配的重新评估在MATH上

标题： A Compute-Matched Re-Evaluation of TroVE on MATH

Authors:Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, Christian Bartelt

摘要：重用已建立的定理和公式是数学问题解决的核心，它们作为处理日益复杂挑战的基本构建块。最近的工作，TroVE，认为生成代码的大规模语言模型（LLMs）可以在MATH基准上类似地受益于引入和重用更高级的工具箱。通过在三个模式的集合中分配计算预算——直接生成代码、创建工具和重用工具——TroVE声称其表现优于仅执行直接生成的PRIMITIVE基线。然而，最近的分析（Berlot-Attwell等，2024年）对这些提升提出了质疑，指出所创建的工具通常很微不足道或很少被重用，这表明改进可能来自自我一致性或自我修正。在本工作中，我们在MATH上重新评估了TroVE，分析了其每个模式的影响，并表明其优势并非来自这些机制，而是仅仅因为TroVE相比PRIMITIVE投入了更高的计算预算。为此，我们还对TroVE原始选择机制的实现进行了一个小的修正，使TroVE在MATH上的准确率提高了3%。在计算资源匹配后，TroVE的优势减少到1%的微小提升，这表明这种工具箱方法在MATH上并没有提供显著的优势。

摘要： Reusing established theorems and formulas is central to mathematical problem solving, serving as essential building blocks for tackling increasingly complex challenges. Recent work, TroVE, argues that code-generating Large Language Models (LLMs) can benefit similarly on the MATH benchmark by inducing and reusing higher-level toolboxes. By allocating computational budget across an ensemble of three modes -- directly generating code, creating tools, and reusing tools -- TroVE claims to outperform a PRIMITIVE baseline that only performs direct generation. However, recent analysis (Berlot-Attwell et al., 2024) casts doubt on these gains, noting that the tools created are often trivial or rarely reused, suggesting that improvements may stem from self-consistency or self-correction. In this work, we re-evaluate TroVE on MATH, analyze the impact of each of its modes, and show that its benefit does not come from these mechanisms, but simply from a higher computational budget spent for TroVE compared to PRIMITIVE. To this end, we also perform a small correction in the original implementation of TroVE's selection mechanism, boosting TroVE's performance on MATH by 3\% in accuracy. After matching for compute, the benefit of TroVE reduces to a marginal improvement of 1\%, suggesting that this toolbox approach does not provide a significant benefit on MATH.

主题：	编程语言 (cs.PL) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.22069 [cs.PL]
	(或者 arXiv:2507.22069v2 [cs.PL] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.22069

提交历史

来自： Tobias Sesterhenn [查看电子邮件]
[v1] 星期三， 2025 年 7 月 16 日 03:11:43 UTC (96 KB)
[v2] 星期四， 2025 年 7 月 31 日 07:33:11 UTC (98 KB)

计算机科学 > 编程语言

标题：一个计算匹配的重新评估在MATH上

标题： A Compute-Matched Re-Evaluation of TroVE on MATH

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 编程语言

标题： 一个计算匹配的重新评估在MATH上 显示英文标题

标题： A Compute-Matched Re-Evaluation of TroVE on MATH

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：一个计算匹配的重新评估在MATH上