Search Results for author: Tengyu Xu

Found 22 papers, 2 papers with code

The Perfect Blend: Redefining RLHF with Mixture of Judges

no code implementations30 Sep 2024 Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, Zhouhao Zeng, Yun He, Karishma Mandyam, Arya Talabzadeh, Madian Khabsa, Gabriel Cohen, Yuandong Tian, Hao Ma, Sinong Wang, Han Fang

However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i. e., trade-off of multiple and/or sometimes conflicting objectives).

Instruction Following Math +1

Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

no code implementations13 Jun 2022 Tengyu Xu, Yue Wang, Shaofeng Zou, Yingbin Liang

The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair.

Offline RL reinforcement-learning +2

Model-Based Offline Meta-Reinforcement Learning with Regularization

no code implementations ICLR 2022 Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang

In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy.

Meta Reinforcement Learning reinforcement-learning +3

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

no code implementations20 Oct 2021 Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, Guanghui Lan

Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge to the global optimum with a complexity of $\tilde{\mathcal O}(1/\epsilon)$ in terms of the optimality gap and the constraint violation, which improves the complexity of the existing primal-dual approach by a factor of $\mathcal O(1/\epsilon)$ \citep{ding2020natural, paternain2019constrained}.

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

no code implementations ICLR 2022 Ziwei Guan, Tengyu Xu, Yingbin Liang

Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations.

A Unified Off-Policy Evaluation Approach for General Value Function

no code implementations6 Jul 2021 Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang

We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large.

Anomaly Detection Off-policy evaluation

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

no code implementations23 Feb 2021 Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang

We also show that the overall convergence of DR-Off-PAC is doubly robust to the approximation errors that depend only on the expressive power of approximation functions.

Proximal Gradient Descent-Ascent: Variable Convergence under KŁ Geometry

no code implementations ICLR 2021 Ziyi Chen, Yi Zhou, Tengyu Xu, Yingbin Liang

By leveraging this Lyapunov function and the K{\L} geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point $x^*$, i. e., $x_t\to x^*, y_t\to y^*(x^*)$.

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

no code implementations11 Nov 2020 Tengyu Xu, Yingbin Liang, Guanghui Lan

To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction.

reinforcement-learning Reinforcement Learning (RL) +1

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

no code implementations10 Nov 2020 Tengyu Xu, Yingbin Liang

For linear TDC, we provide a novel non-asymptotic analysis and show that it attains an $\epsilon$-accurate solution with the optimal sample complexity of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$ under a constant stepsize.

reinforcement-learning Reinforcement Learning (RL) +1

Enhanced First and Zeroth Order Variance Reduced Algorithms for Min-Max Optimization

no code implementations28 Sep 2020 Tengyu Xu, Zhe Wang, Yingbin Liang, H. Vincent Poor

Specifically, a novel variance reduction algorithm SREDA was proposed recently by (Luo et al. 2020) to solve such a problem, and was shown to achieve the optimal complexity dependence on the required accuracy level $\epsilon$.

A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis

no code implementations28 Sep 2020 Tengyu Xu, Yingbin Liang, Guanghui Lan

To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction.

Safe Reinforcement Learning

When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence

no code implementations24 Jun 2020 Ziwei Guan, Tengyu Xu, Yingbin Liang

Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories.

Imitation Learning

Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence

no code implementations16 Jun 2020 Tengyu Xu, Zhe Wang, Yingbin Liang, H. Vincent Poor

In this paper, we focus on such a gradient-free setting, and consider the nonconvex-strongly-concave minimax stochastic optimization problem.

Stochastic Optimization

Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

no code implementations7 May 2020 Tengyu Xu, Zhe Wang, Yingbin Liang

In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established.

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

no code implementations NeurIPS 2020 Tengyu Xu, Zhe Wang, Yingbin Liang

We show that the overall sample complexity for a mini-batch AC to attain an $\epsilon$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$, and the overall sample complexity for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon))$.

Reinforcement Learning

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

no code implementations15 Feb 2020 Huaqing Xiong, Tengyu Xu, Yingbin Liang, Wei zhang

Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established.

reinforcement-learning Reinforcement Learning +1

Reanalysis of Variance Reduced Temporal Difference Learning

no code implementations ICLR 2020 Tengyu Xu, Zhe Wang, Yi Zhou, Yingbin Liang

Furthermore, the variance error (for both i. i. d.\ and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD.

Reinforcement Learning

Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

no code implementations NeurIPS 2019 Tengyu Xu, Shaofeng Zou, Yingbin Liang

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios.

When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models?

1 code implementation ICLR 2019 Tengyu Xu, Yi Zhou, Kaiyi Ji, Yingbin Liang

We study the implicit bias of gradient descent methods in solving a binary classification problem over a linearly separable dataset.

Binary Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.