1 code implementation • 17 Jan 2024 • Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers.
1 code implementation • 20 Sep 2023 • Zhanming Jie, Trung Quoc Luong, Xinbo Zhang, Xiaoran Jin, Hang Li
We also find that Python is a better choice of language than Wolfram for program CoTs.
no code implementations • 16 May 2023 • Shuwei Feng, Tianyang Zhan, Zhanming Jie, Trung Quoc Luong, Xiaoran Jin
This paper presents GenDoc, a general sequence-to-sequence document understanding model pre-trained with unified masking across three modalities: text, image, and layout.