1 code implementation • 29 May 2024 • Shenao Zhang, Donghan Yu, Hiteshi Sharma, ZiYi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang
Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions.
no code implementations • 26 May 2024 • Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term.
1 code implementation • 25 Feb 2024 • Shenao Zhang, Sirui Zheng, Shuqi Ke, Zhihan Liu, Wanxin Jin, Jianbo Yuan, Yingxiang Yang, Hongxia Yang, Zhaoran Wang
Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration.
1 code implementation • 29 Sep 2023 • Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future").
1 code implementation • NeurIPS 2023 • Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration.
no code implementations • 25 May 2023 • Xiaoyu Chen, Shenao Zhang, Pushi Zhang, Li Zhao, Jianyu Chen
With strong capabilities of reasoning and a broad understanding of the world, Large Language Models (LLMs) have demonstrated immense potential in building versatile embodied decision-making agents capable of executing a wide array of tasks.
no code implementations • 16 Sep 2022 • Shenao Zhang
In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 30 Aug 2021 • Shenao Zhang, Lei Han, Li Shen
In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number.
Multi-agent Reinforcement Learning reinforcement-learning +1
1 code implementation • 12 Jun 2021 • Shenao Zhang, Li Shen, Zhifeng Li, Wei Liu
Capturing contextual dependencies has proven useful to improve the representational power of deep neural networks.