1 code implementation • 10 Feb 2025 • Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen
To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning.
1 code implementation • 5 Jul 2022 • Weihan Cao, Yifan Zhang, Jianfei Gao, Anda Cheng, Ke Cheng, Jian Cheng
First, the difference in feature magnitude between the teacher and the student could enforce overly strict constraints on the student.