no code implementations • 19 Oct 2024 • Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, Yi Wu
In this work, we evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM), and train a collection of LLMs for math problems using RL by combining these learned rewards with success rewards.
1 code implementation • 20 Jun 2024 • Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, Yi Wu
To overcome this limitation, we propose a novel technique named parameter ReaLlocation, which dynamically adapts the parallelization strategies for different workloads during training by redistributing LLM parameters across the training cluster.
1 code implementation • 16 Apr 2024 • Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO).
2 code implementations • 29 Jun 2023 • Zhiyu Mei, Wei Fu, Jiaxuan Gao, Guangju Wang, Huanchen Zhang, Yi Wu
Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms.