NeurIPS 2022 The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games [1]:提出并开源了用于多智能体的强化学习并行训练框架 MAPPO,支持合作场景下的多智能体训练,该工作被大量多智能体领域工作采用,目前论文引用量已超过 1k。
ICLR 2024 Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores [2]: 提出了用于强化学习的分布式训练框架,可轻松扩展至上万个核心,加速比超越 OpenAI 的大规模强化学习系统 Rapid。
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation [3]: 最近,吴翼团队进一步实现了分布式 RLHF 训练框架 ReaLHF。吴翼团队的 ICML Oral 论文正是基于 ReaLHF 系统产出的。ReaLHF 系统经过长时间的开发,经历大量的细节打磨,达到最优性能。相比于之前的开源工作,ReaLHF 可以在 RLHF 这个比预训练更复杂的场景下达到近乎线性的拓展性,同时具有更高的资源利用率,在 128 块 A100 GPU 上也能稳定快速地进行 RLHF 训练,相关工作已开源:https://github.com/openpsi-project/ReaLHF
[1] Yu, Chao, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. "The surprising effectiveness of ppo in cooperative multi-agent games."[2] Mei, Zhiyu, Wei Fu, Guangju Wang, Huanchen Zhang, and Yi Wu. "SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores."[3] Mei, Zhiyu, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. "ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation."[4] Xu, Shusheng, Huaijie Wang, Jiaxuan Gao, Yutao Ouyang, Chao Yu, and Yi Wu. "Language-guided generation of physically realistic robot motion and control."[5] Xu, Zelai, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. "Language agents with reinforcement learning for strategic play in the werewolf game."[6] Liu, Jijia, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, Yi Wu, and Yu Wang. "Llm-powered hierarchical language agent for real-time human-ai coordination."[7] Ouyang, Yutao, Jinhan Li, Yunfei Li, Zhongyu Li, Chao Yu, Koushil Sreenath, and Yi Wu. "Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models."