EEFSUVA: A New Mathematical Olympiad Benchmark

Khatibi, Nicole N; Radamovich, Daniil A.; Brenner, Michael P.

计算机科学 > 计算与语言

arXiv:2510.01227v1 (cs)

[提交于 2025年9月23日 ]

标题： EEFSUVA：一个新的数学奥林匹克基准

标题： EEFSUVA: A New Mathematical Olympiad Benchmark

Authors:Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner

摘要：最近的突破引发了这样的说法，即大型语言模型（LLMs）在数学基准测试中达到了金牌奥林匹克竞赛到研究生水平的熟练程度。在本工作中，我们详细检查了这些说法，并评估了当前基准测试在多大程度上能够捕捉真正的LLM数学推理能力。这些基准测试的组成主要来自于国际数学奥林匹克竞赛（IMO）及相关竞赛，由于潜在的数据污染和对熟悉题型的狭窄关注，可能会高估模型的推理能力。为了实现对数学理解更全面的评估，我们引入了EEFSUVA，这是一个从东欧和前苏联国家的区域和国家级奥林匹克竞赛中精心挑选的新基准。这些比赛的问题难度与IMO相当，并以要求非标准解题技巧而闻名，但它们的问题在在线语料库中却远不如其他常见。初步结果表明，即使是最先进的LLM在EEFSUVA上的表现也相对于其他奥林匹克风格的基准有显著下降。这些发现还表明，更广泛的评估数据集对于更全面地评估数学推理和指导未来模型开发可能具有重要意义。

摘要： Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

评论：	16页，5图
主题：	计算与语言 (cs.CL) ; 历史与概述 (math.HO)
引用方式：	arXiv:2510.01227 [cs.CL]
	(或者 arXiv:2510.01227v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.01227

提交历史

来自： Nicole Khatibi [查看电子邮件]
[v1] 星期二， 2025 年 9 月 23 日 01:57:56 UTC (20 KB)

计算机科学 > 计算与语言

标题： EEFSUVA：一个新的数学奥林匹克基准

标题： EEFSUVA: A New Mathematical Olympiad Benchmark

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： EEFSUVA：一个新的数学奥林匹克基准 显示英文标题

标题： EEFSUVA: A New Mathematical Olympiad Benchmark

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： EEFSUVA：一个新的数学奥林匹克基准