Search Results for author: Yilin Bao

Found 1 papers, 1 papers with code

Offline Reinforcement Learning for LLM Multi-Step Reasoning

2 code implementations20 Dec 2024 Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu

While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward.

GSM8K Math +5

Cannot find the paper you are looking for? You can Submit a new open access paper.