no code implementations • 14 Feb 2024 • Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang
We also prove a lower bound to show that the additive dependence on $C$ is optimal.
Model-based Reinforcement Learning reinforcement-learning +2
1 code implementation • 11 Feb 2024 • Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang
We study Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle.
3 code implementations • 18 Dec 2023 • Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang
We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees.
no code implementations • 22 Nov 2023 • Jianqing Fan, Zhaoran Wang, Zhuoran Yang, Chenlu Ye
For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches.
1 code implementation • NeurIPS 2023 • Chenlu Ye, Rui Yang, Quanquan Gu, Tong Zhang
Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal{O}(\zeta (C(\widehat{\mathcal{F}},\mu)n)^{-1})$ due to the corruption.
no code implementations • 5 Sep 2023 • Yong Lin, Chen Liu, Chenlu Ye, Qing Lian, Yuan YAO, Tong Zhang
Our proposed method, COPS (unCertainty based OPtimal Sub-sampling), is designed to minimize the expected loss of a model trained on subsampled data.
no code implementations • 12 Dec 2022 • Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang
In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$.