Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Chen, Rubing; Wu, Jiaxin; Wang, Jian; Zhang, Xulu; Fan, Wenqi; Lin, Chenghua; Wei, Xiao-Yong; Li, Qing

计算机科学 > 人工智能

arXiv:2508.07353 (cs)

[提交于 2025年8月10日 ]

标题：重新思考领域专用大语言模型基准构建：全面性与紧凑性方法

标题： Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Authors:Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

摘要：许多基准已被构建用于评估大型语言模型（LLMs）的领域特定能力，突显了有效且高效构建基准的必要性。现有的领域特定基准主要关注扩展定律，依赖大量语料进行监督微调或生成广泛的题集以实现广泛覆盖。然而，语料和问题答案（QA）集设计对领域特定LLMs的精确率和召回率的影响仍未被探索。在本文中，我们解决了这一差距，并证明扩展定律并不总是特定领域基准构建的最佳原则。相反，我们提出了Comp-Comp，一个基于全面性-紧凑性原则的迭代基准框架。在这里，全面性确保领域的语义召回，而紧凑性提高精确度，指导语料和QA集的构建。为了验证我们的框架，我们在一所著名的大学进行了案例研究，从而创建了XUBench，一个大规模且全面的封闭领域基准。尽管我们在本工作中使用学术领域作为案例，但我们的Comp-Comp框架设计为可扩展至学术以外的领域，为各个领域的基准构建提供了有价值的见解。

摘要： Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.

主题：	人工智能 (cs.AI) ; 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：	arXiv:2508.07353 [cs.AI]
	(或者 arXiv:2508.07353v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.07353

提交历史

来自： Rubing Chen [查看电子邮件]
[v1] 星期日， 2025 年 8 月 10 日 14:08:28 UTC (5,640 KB)

计算机科学 > 人工智能

标题：重新思考领域专用大语言模型基准构建：全面性与紧凑性方法

标题： Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 重新思考领域专用大语言模型基准构建：全面性与紧凑性方法 显示英文标题

标题： Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：重新思考领域专用大语言模型基准构建：全面性与紧凑性方法