Search Results for author: Wenhao Zhan

Found 13 papers, 3 papers with code

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

1 code implementation6 Oct 2024 Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun

Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop.

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

no code implementations18 Jul 2024 Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster

Language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model plateaus or degrades over the course of the alignment process.

Language Modelling

REBEL: Reinforcement Learning via Regressing Relative Rewards

3 code implementations25 Apr 2024 Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models.

continuous-control Continuous Control +5

Dataset Reset Policy Optimization for RLHF

1 code implementation12 Apr 2024 Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

Motivated by the fact that offline preference dataset provides informative states (i. e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution.

Reinforcement Learning (RL)

Optimal Multi-Distribution Learning

no code implementations8 Dec 2023 Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee

Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound.

Fairness

Provable Reward-Agnostic Preference-Based Reinforcement Learning

no code implementations29 May 2023 Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.

reinforcement-learning Reinforcement Learning

Provable Offline Preference-Based Reinforcement Learning

no code implementations24 May 2023 Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.

reinforcement-learning Reinforcement Learning

PAC Reinforcement Learning for Predictive State Representations

no code implementations12 Jul 2022 Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.

reinforcement-learning Reinforcement Learning +1

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

no code implementations3 Jun 2022 Wenhao Zhan, Jason D. Lee, Zhuoran Yang

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.

Decision Making

Offline Reinforcement Learning with Realizability and Single-policy Concentrability

no code implementations9 Feb 2022 Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee

Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e. g., Bellman-completeness) and the data coverage (e. g., all-policy concentrability).

Offline RL reinforcement-learning +2

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

no code implementations24 May 2021 Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi

These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.

Reinforcement Learning (RL)

Cannot find the paper you are looking for? You can Submit a new open access paper.