no code implementations • 14 Feb 2024 • Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang
We also prove a lower bound to show that the additive dependence on $C$ is optimal.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 11 Feb 2024 • Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang
We study Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle.
no code implementations • 18 Dec 2023 • Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang
This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios.
no code implementations • 22 Nov 2023 • Jianqing Fan, Zhaoran Wang, Zhuoran Yang, Chenlu Ye
For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches.
1 code implementation • NeurIPS 2023 • Chenlu Ye, Rui Yang, Quanquan Gu, Tong Zhang
Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal{O}(\zeta (C(\widehat{\mathcal{F}},\mu)n)^{-1})$ due to the corruption.
no code implementations • 5 Sep 2023 • Yong Lin, Chen Liu, Chenlu Ye, Qing Lian, Yuan YAO, Tong Zhang
Our proposed method, COPS (unCertainty based OPtimal Sub-sampling), is designed to minimize the expected loss of a model trained on subsampled data.
no code implementations • 12 Dec 2022 • Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang
In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$.