Search Results for author: Shi Dong

Found 11 papers, 1 papers with code

RLHF and IIA: Perverse Incentives

no code implementations2 Dec 2023 Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA).

reinforcement-learning

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

1 code implementation4 Jun 2023 Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences.

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

no code implementations19 May 2023 Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy

In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function.

Efficient Exploration Language Modelling +2

Inclusive Artificial Intelligence

no code implementations24 Dec 2022 Dilip Arumugam, Shi Dong, Benjamin Van Roy

Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual.

Posterior Sampling for Continuing Environments

no code implementations29 Nov 2022 Wanqiao Xu, Shi Dong, Benjamin Van Roy

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments.

UIILD: A Unified Interpretable Intelligent Learning Diagnosis Framework for Intelligent Tutoring Systems

no code implementations7 Jul 2022 Zhifeng Wang, Wenxing Yan, Chunyan Zeng, Shi Dong

Intelligent learning diagnosis is a critical engine of intelligent tutoring systems, which aims to estimate learners' current knowledge mastery status and predict their future learning performance.

Representation Learning

Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States

no code implementations10 Feb 2021 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent.

Q-Learning reinforcement-learning +2

Provably Efficient Reinforcement Learning with Aggregated States

no code implementations13 Dec 2019 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state.

Q-Learning reinforcement-learning +1

Comments on the Du-Kakade-Wang-Yang Lower Bounds

no code implementations18 Nov 2019 Benjamin Van Roy, Shi Dong

Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable.

reinforcement-learning Reinforcement Learning (RL)

On the Performance of Thompson Sampling on Logistic Bandits

no code implementations12 May 2019 Shi Dong, Tengyu Ma, Benjamin Van Roy

Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.

Thompson Sampling

An Information-Theoretic Analysis for Thompson Sampling with Many Actions

no code implementations NeurIPS 2018 Shi Dong, Benjamin Van Roy

We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.

Thompson Sampling

Cannot find the paper you are looking for? You can Submit a new open access paper.