1 code implementation • 21 Oct 2024 • Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, Sinong Wang
To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions.
no code implementations • 30 Sep 2024 • Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, Zhouhao Zeng, Yun He, Karishma Mandyam, Arya Talabzadeh, Madian Khabsa, Gabriel Cohen, Yuandong Tian, Hao Ma, Sinong Wang, Han Fang
However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i. e., trade-off of multiple and/or sometimes conflicting objectives).
no code implementations • 13 Jun 2022 • Tengyu Xu, Yue Wang, Shaofeng Zou, Yingbin Liang
The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair.
no code implementations • ICLR 2022 • Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang
In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy.
no code implementations • 20 Oct 2021 • Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, Guanghui Lan
Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge to the global optimum with a complexity of $\tilde{\mathcal O}(1/\epsilon)$ in terms of the optimality gap and the constraint violation, which improves the complexity of the existing primal-dual approach by a factor of $\mathcal O(1/\epsilon)$ \citep{ding2020natural, paternain2019constrained}.
no code implementations • ICLR 2022 • Ziwei Guan, Tengyu Xu, Yingbin Liang
Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations.
no code implementations • 6 Jul 2021 • Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang
We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large.
no code implementations • 23 Feb 2021 • Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang
We also show that the overall convergence of DR-Off-PAC is doubly robust to the approximation errors that depend only on the expressive power of approximation functions.
no code implementations • ICLR 2021 • Ziyi Chen, Yi Zhou, Tengyu Xu, Yingbin Liang
By leveraging this Lyapunov function and the K{\L} geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point $x^*$, i. e., $x_t\to x^*, y_t\to y^*(x^*)$.
no code implementations • 11 Nov 2020 • Tengyu Xu, Yingbin Liang, Guanghui Lan
To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction.
no code implementations • 10 Nov 2020 • Tengyu Xu, Yingbin Liang
For linear TDC, we provide a novel non-asymptotic analysis and show that it attains an $\epsilon$-accurate solution with the optimal sample complexity of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$ under a constant stepsize.
no code implementations • 28 Sep 2020 • Tengyu Xu, Zhe Wang, Yingbin Liang, H. Vincent Poor
Specifically, a novel variance reduction algorithm SREDA was proposed recently by (Luo et al. 2020) to solve such a problem, and was shown to achieve the optimal complexity dependence on the required accuracy level $\epsilon$.
no code implementations • 28 Sep 2020 • Tengyu Xu, Yingbin Liang, Guanghui Lan
To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction.
no code implementations • 24 Jun 2020 • Ziwei Guan, Tengyu Xu, Yingbin Liang
Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories.
no code implementations • 16 Jun 2020 • Tengyu Xu, Zhe Wang, Yingbin Liang, H. Vincent Poor
In this paper, we focus on such a gradient-free setting, and consider the nonconvex-strongly-concave minimax stochastic optimization problem.
no code implementations • 7 May 2020 • Tengyu Xu, Zhe Wang, Yingbin Liang
In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established.
no code implementations • NeurIPS 2020 • Tengyu Xu, Zhe Wang, Yingbin Liang
We show that the overall sample complexity for a mini-batch AC to attain an $\epsilon$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$, and the overall sample complexity for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon))$.
no code implementations • 15 Feb 2020 • Huaqing Xiong, Tengyu Xu, Yingbin Liang, Wei zhang
Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established.
no code implementations • ICLR 2020 • Tengyu Xu, Zhe Wang, Yi Zhou, Yingbin Liang
Furthermore, the variance error (for both i. i. d.\ and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD.
no code implementations • NeurIPS 2019 • Tengyu Xu, Shaofeng Zou, Yingbin Liang
Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios.
no code implementations • NeurIPS 2019 • Shaofeng Zou, Tengyu Xu, Yingbin Liang
For this fitted SARSA algorithm, we also provide its finite-sample analysis.
1 code implementation • ICLR 2019 • Tengyu Xu, Yi Zhou, Kaiyi Ji, Yingbin Liang
We study the implicit bias of gradient descent methods in solving a binary classification problem over a linearly separable dataset.