MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Yuan, Qianhao; Lou, Jie; Li, Zichao; Chen, Jiawei; Lu, Yaojie; Lin, Hongyu; Sun, Le; Zhang, Debing; Han, Xianpei

计算机科学 > 计算与语言

arXiv:2511.02805 (cs)

[提交于 2025年11月4日 ]

标题： MemSearcher：通过端到端强化学习训练大语言模型进行推理、搜索和管理记忆

标题： MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Authors:Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han

摘要：典型的搜索代理将整个交互历史连接到LLM上下文中，保持信息完整性但产生长而嘈杂的上下文，导致高计算和内存成本。相反，仅使用当前回合可以避免这种开销，但会丢弃必要的信息。这种权衡限制了搜索代理的可扩展性。为了解决这个挑战，我们提出了MemSearcher，一种迭代维护紧凑记忆并将其与当前回合相结合的代理工作流。在每个回合中，MemSearcher将用户的问题与记忆融合，生成推理轨迹，执行搜索操作，并更新记忆以仅保留解决任务所需的信息。这种设计在多回合交互中稳定了上下文长度，提高了效率而不牺牲准确性。为了优化此工作流，我们引入了多上下文GRPO，这是一种端到端的强化学习框架，联合优化MemSearcher代理的推理、搜索策略和记忆管理。具体来说，多上下文GRPO在不同上下文中采样轨迹组，并在它们内部的所有对话中传播轨迹级优势。在与Search-R1相同的数据显示训练下，MemSearcher在七个公开基准上显著优于强基线：相对于平均增益，Qwen2.5-3B-Instruct提升11%，Qwen2.5-7B-Instruct提升12%。值得注意的是，基于3B的MemSearcher甚至超过了基于7B的基线，这表明在信息完整性和效率之间取得平衡可以实现更高的准确性和更低的计算开销。代码和模型将在https://github.com/icip-cas/MemSearcher公开可用。

摘要： Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

评论：	项目页面：https://github.com/icip-cas/MemSearcher
主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI)
引用方式：	arXiv:2511.02805 [cs.CL]
	(或者 arXiv:2511.02805v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2511.02805

提交历史

来自： Qianhao Yuan [查看电子邮件]
[v1] 星期二， 2025 年 11 月 4 日 18:27:39 UTC (1,413 KB)

计算机科学 > 计算与语言

标题： MemSearcher：通过端到端强化学习训练大语言模型进行推理、搜索和管理记忆

标题： MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： MemSearcher：通过端到端强化学习训练大语言模型进行推理、搜索和管理记忆 显示英文标题

标题： MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： MemSearcher：通过端到端强化学习训练大语言模型进行推理、搜索和管理记忆