1 code implementation • 6 Oct 2024 • Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun
Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop.
no code implementations • 1 Oct 2024 • Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni
We study the problem of learning an approximate equilibrium in the offline multi-agent reinforcement learning (MARL) setting.
no code implementations • 18 Jul 2024 • Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster
Language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model plateaus or degrades over the course of the alignment process.
3 code implementations • 25 Apr 2024 • Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models.
1 code implementation • 12 Apr 2024 • Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
Motivated by the fact that offline preference dataset provides informative states (i. e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution.
no code implementations • 8 Dec 2023 • Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee
Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound.
no code implementations • 20 Nov 2023 • Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee
We study CVaR RL in low-rank MDPs with nonlinear function approximation.
no code implementations • 29 May 2023 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.
no code implementations • 24 May 2023 • Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.
no code implementations • 12 Jul 2022 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.
no code implementations • 3 Jun 2022 • Wenhao Zhan, Jason D. Lee, Zhuoran Yang
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
no code implementations • 9 Feb 2022 • Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee
Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e. g., Bellman-completeness) and the data coverage (e. g., all-policy concentrability).
no code implementations • 24 May 2021 • Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi
These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.