Efficient RLHF: Reducing the Memory Usage of PPO

Santacroce, Michael; Lu, Yadong; Yu, Han; Li, Yuanzhi; Shen, Yelong

计算机科学 > 机器学习

arXiv:2309.00754v1 (cs)

[提交于 2023年9月1日 ]

标题：高效RLHF：降低PPO的内存使用量

标题： Efficient RLHF: Reducing the Memory Usage of PPO

Authors:Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen

摘要：基于人类反馈的强化学习（RLHF）通过将模型与人类偏好对齐，彻底改变了语言建模。然而，RL阶段中的近端策略优化（PPO）所需的内存是监督微调（SFT）的3倍以上，这使得大多数从业者无法使用。为了解决这个问题，我们对PPO的内存节约技术的内存使用、性能和训练时间进行了全面分析。我们首先将SFT和奖励模型集成，然后在训练过程中动态地将LoRA“关闭”，从而提出了Hydra-RLHF。我们的实验表明：1. 在PPO中使用LoRA可将其内存使用量减少到低于SFT的水平，同时在四个公共基准上改善了对齐效果，2. Hydra-PPO将LoRA-PPO的每个样本延迟降低了高达65%，同时保持其性能。我们的结果表明，Hydra-PPO是一种简单且有前景的解决方案，有助于更广泛地应用RLHF。

摘要： Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2309.00754 [cs.LG]
	(或者 arXiv:2309.00754v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2309.00754

提交历史

来自： Michael Santacroce [查看电子邮件]
[v1] 星期五， 2023 年 9 月 1 日 22:57:20 UTC (200 KB)

计算机科学 > 机器学习

标题：高效RLHF：降低PPO的内存使用量

标题： Efficient RLHF: Reducing the Memory Usage of PPO

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 高效RLHF：降低PPO的内存使用量 显示英文标题

标题： Efficient RLHF: Reducing the Memory Usage of PPO

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：高效RLHF：降低PPO的内存使用量