no code implementations • 2 Dec 2023 • Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA).
1 code implementation • 4 Jun 2023 • Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences.
no code implementations • 19 May 2023 • Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy
In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function.
no code implementations • 24 Dec 2022 • Dilip Arumugam, Shi Dong, Benjamin Van Roy
Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual.
no code implementations • 29 Nov 2022 • Wanqiao Xu, Shi Dong, Benjamin Van Roy
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments.
no code implementations • 7 Jul 2022 • Zhifeng Wang, Wenxing Yan, Chunyan Zeng, Shi Dong
Intelligent learning diagnosis is a critical engine of intelligent tutoring systems, which aims to estimate learners' current knowledge mastery status and predict their future learning performance.
no code implementations • 10 Feb 2021 • Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent.
no code implementations • 13 Dec 2019 • Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state.
no code implementations • 18 Nov 2019 • Benjamin Van Roy, Shi Dong
Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable.
no code implementations • 12 May 2019 • Shi Dong, Tengyu Ma, Benjamin Van Roy
Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.
no code implementations • NeurIPS 2018 • Shi Dong, Benjamin Van Roy
We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.