1 code implementation • 16 Jun 2024 • Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
no code implementations • 26 May 2024 • Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model; one that simultaneously minimizes the maximum likelihood estimation of the loss and a reward penalty term.
no code implementations • 9 Apr 2024 • Xudong Yu, Chenjia Bai, Hongyi Guo, Changhong Wang, Zhen Wang
Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions.
no code implementations • 12 Mar 2024 • Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences.
no code implementations • 8 Mar 2024 • Hongyi Guo, Zhihan Liu, Yufeng Zhang, Zhaoran Wang
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
no code implementations • 16 Feb 2024 • Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu
FEWL leverages the answers from off-the-shelf LLMs that serve as a proxy of gold-standard answers.
no code implementations • 6 Jan 2024 • Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu
The key idea is to first retrieve high-quality samples related to the target domain and use them as In-context Learning examples to generate more samples.
1 code implementation • 29 Sep 2023 • Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future").
1 code implementation • 8 May 2023 • Rushuai Yang, Chenjia Bai, Hongyi Guo, Siyuan Li, Bin Zhao, Zhen Wang, Peng Liu, Xuelong Li
Under mild assumptions, our objective maximizes the MI between different behaviors based on the same skill, which serves as an upper bound of the previous MI objective.
1 code implementation • NeurIPS 2021 • Jingkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu
Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC).
2 code implementations • ICML 2020 • Yang Liu, Hongyi Guo
In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels and does not require a priori specification of the noise rates.
Ranked #15 on Learning with noisy labels on CIFAR-100N
no code implementations • 10 Sep 2019 • Liheng Chen, Hongyi Guo, Yali Du, Fei Fang, Haifeng Zhang, Yaoming Zhu, Ming Zhou, Wei-Nan Zhang, Qing Wang, Yong Yu
Although existing works formulate this problem into a centralized learning with decentralized execution framework, which avoids the non-stationary problem in training, their decentralized execution paradigm limits the agents' capability to coordinate.
Multi-agent Reinforcement Learning reinforcement-learning +2