Search Results for author: Benjamin Van Roy

Found 78 papers, 12 papers with code

A Tutorial on Thompson Sampling

2 code implementations7 Jul 2017 Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.

Active Learning Product Recommendation +1

Epistemic Neural Networks

1 code implementation NeurIPS 2023 Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.

Fine-Tuning Language Models via Epistemic Neural Networks

1 code implementation3 Nov 2022 Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, Geoffrey Irving

Language models often pre-train on large unsupervised text corpora, then fine-tune on additional task-specific data.

Active Learning Language Modelling

Approximate Thompson Sampling via Epistemic Neural Networks

1 code implementation18 Feb 2023 Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost.

Thompson Sampling

Posterior Sampling for Reinforcement Learning Without Episodes

1 code implementation9 Aug 2016 Ian Osband, Benjamin Van Roy

- Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth.

reinforcement-learning Reinforcement Learning (RL)

Generalization and Exploration via Randomized Value Functions

1 code implementation4 Feb 2014 Ian Osband, Benjamin Van Roy, Zheng Wen

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions.

Efficient Exploration reinforcement-learning +1

Langevin DQN

2 code implementations17 Feb 2020 Vikranth Dwaracherla, Benjamin Van Roy

Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions.

Computational Efficiency Open-Ended Question Answering +2

Deep Exploration via Randomized Value Functions

no code implementations22 Mar 2017 Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen

We study the use of randomized value functions to guide deep exploration in reinforcement learning.

Efficient Exploration reinforcement-learning +1

An Information-Theoretic Analysis for Thompson Sampling with Many Actions

no code implementations NeurIPS 2018 Shi Dong, Benjamin Van Roy

We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.

Thompson Sampling

Scalable Coordinated Exploration in Concurrent Reinforcement Learning

1 code implementation NeurIPS 2018 Maria Dimakopoulou, Ian Osband, Benjamin Van Roy

We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale.

reinforcement-learning Reinforcement Learning (RL)

Satisficing in Time-Sensitive Bandit Learning

no code implementations7 Mar 2018 Daniel Russo, Benjamin Van Roy

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action.

Thompson Sampling

Gaussian-Dirichlet Posterior Dominance in Sequential Learning

no code implementations14 Feb 2017 Ian Osband, Benjamin Van Roy

We consider the problem of sequential learning from categorical observations bounded in [0, 1].

Ensemble Sampling

no code implementations NeurIPS 2017 Xiuyuan Lu, Benjamin Van Roy

Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems.

Thompson Sampling

Learning to Price with Reference Effects

no code implementations29 Aug 2017 Abbas Kazerouni, Benjamin Van Roy

As a firm varies the price of a product, consumers exhibit reference effects, making purchase decisions based not only on the prevailing price but also the product's price history.

Thompson Sampling

Learning to Optimize via Information-Directed Sampling

no code implementations NeurIPS 2014 Daniel Russo, Benjamin Van Roy

We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback.

On Optimistic versus Randomized Exploration in Reinforcement Learning

no code implementations13 Jun 2017 Ian Osband, Benjamin Van Roy

We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning.

Computational Efficiency reinforcement-learning +1

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

no code implementations ICML 2017 Ian Osband, Benjamin Van Roy

Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms algorithms driven by optimism, such as UCRL2.

reinforcement-learning Reinforcement Learning (RL)

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

no code implementations28 Apr 2017 Daniel Russo, David Tse, Benjamin Van Roy

We propose satisficing Thompson sampling -- a variation of Thompson sampling -- and establish a strong discounted regret bound for this new algorithm.

Thompson Sampling

Conservative Contextual Linear Bandits

no code implementations NeurIPS 2017 Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, Benjamin Van Roy

We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant (does not grow with the time horizon) term that accounts for the loss of being conservative in order to satisfy the safety constraint.

Decision Making Marketing

On Lower Bounds for Regret in Reinforcement Learning

no code implementations9 Aug 2016 Ian Osband, Benjamin Van Roy

This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning.

reinforcement-learning Reinforcement Learning (RL)

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

no code implementations18 Jul 2013 Zheng Wen, Benjamin Van Roy

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.

Efficient Exploration reinforcement-learning +1

Bootstrapped Thompson Sampling and Deep Exploration

no code implementations1 Jul 2015 Ian Osband, Benjamin Van Roy

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions.

reinforcement-learning Reinforcement Learning (RL) +1

An Information-Theoretic Analysis of Thompson Sampling

no code implementations21 Mar 2014 Daniel Russo, Benjamin Van Roy

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback.

Thompson Sampling

Near-optimal Reinforcement Learning in Factored MDPs

no code implementations NeurIPS 2014 Ian Osband, Benjamin Van Roy

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces.

reinforcement-learning Reinforcement Learning (RL)

Learning to Optimize Via Posterior Sampling

no code implementations11 Jan 2013 Daniel Russo, Benjamin Van Roy

This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems.

Thompson Sampling

(More) Efficient Reinforcement Learning via Posterior Sampling

no code implementations NeurIPS 2013 Ian Osband, Daniel Russo, Benjamin Van Roy

This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Efficient Exploration reinforcement-learning +1

Eluder Dimension and the Sample Complexity of Optimistic Exploration

no code implementations NeurIPS 2013 Daniel Russo, Benjamin Van Roy

This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms.

Thompson Sampling

Efficient Exploration and Value Function Generalization in Deterministic Systems

no code implementations NeurIPS 2013 Zheng Wen, Benjamin Van Roy

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.

Efficient Exploration reinforcement-learning +1

On the Performance of Thompson Sampling on Logistic Bandits

no code implementations12 May 2019 Shi Dong, Tengyu Ma, Benjamin Van Roy

Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.

Thompson Sampling

Comments on the Du-Kakade-Wang-Yang Lower Bounds

no code implementations18 Nov 2019 Benjamin Van Roy, Shi Dong

Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable.

reinforcement-learning Reinforcement Learning (RL)

Provably Efficient Reinforcement Learning with Aggregated States

no code implementations13 Dec 2019 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state.

Q-Learning reinforcement-learning +1

Adaptive Execution: Exploration and Learning of Price Impact

no code implementations26 Jul 2012 Beomsoo Park, Benjamin Van Roy

The trader must learn coefficients of a price impact model while trading.

Trading and Market Microstructure

Randomized Value Functions via Posterior State-Abstraction Sampling

no code implementations5 Oct 2020 Dilip Arumugam, Benjamin Van Roy

State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms.

On Efficiency in Hierarchical Reinforcement Learning

no code implementations NeurIPS 2020 Zheng Wen, Doina Precup, Morteza Ibrahimi, Andre Barreto, Benjamin Van Roy, Satinder Singh

Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency.

Computational Efficiency Decision Making +4

Deciding What to Learn: A Rate-Distortion Approach

no code implementations15 Jan 2021 Dilip Arumugam, Benjamin Van Roy

Agents that learn to select optimal actions represent a prominent focus of the sequential decision-making literature.

Decision Making Thompson Sampling

Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States

no code implementations10 Feb 2021 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent.

Q-Learning reinforcement-learning +2

A Bit Better? Quantifying Information for Bandit Learning

no code implementations18 Feb 2021 Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation.

Reinforcement Learning, Bit by Bit

no code implementations6 Mar 2021 Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen

To illustrate concepts, we design simple agents that build on them and present computational results that highlight data efficiency.

reinforcement-learning Reinforcement Learning (RL)

Deep Exploration for Recommendation Systems

no code implementations26 Sep 2021 Zheqing Zhu, Benjamin Van Roy

Where past work has aimed to learn from subsequent behavior, there has been a lack of effective methods for probing to elicit informative delayed feedback.

Recommendation Systems Thompson Sampling

The Value of Information When Deciding What to Learn

no code implementations NeurIPS 2021 Dilip Arumugam, Benjamin Van Roy

All sequential decision-making agents explore so as to acquire knowledge about a particular target.

Decision Making

Gaussian Imagination in Bandit Learning

no code implementations6 Jan 2022 Yueyang Liu, Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit.

An Information-Theoretic Framework for Supervised Learning

no code implementations1 Mar 2022 Hong Jun Jeon, Yifan Zhu, Benjamin Van Roy

For a particular prior distribution on weights, we establish sample complexity bounds that are simultaneously width independent and linear in depth.

An Analysis of Ensemble Sampling

no code implementations2 Mar 2022 Chao Qin, Zheng Wen, Xiuyuan Lu, Benjamin Van Roy

Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable.

Thompson Sampling

Non-Stationary Bandit Learning via Predictive Sampling

no code implementations4 May 2022 Yueyang Liu, Xu Kuang, Benjamin Van Roy

We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity.

Attribute Thompson Sampling

Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

no code implementations4 Jun 2022 Dilip Arumugam, Benjamin Van Roy

To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model.

Decision Making Model-based Reinforcement Learning +2

Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning

no code implementations4 Jun 2022 Dilip Arumugam, Benjamin Van Roy

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment.

Decision Making Model-based Reinforcement Learning +2

Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

no code implementations8 Jun 2022 Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy

In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions.

Robustness of Epinets against Distributional Shifts

no code implementations1 Jul 2022 Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy

However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning.

Is Stochastic Gradient Descent Near Optimal?

no code implementations18 Sep 2022 Yifan Zhu, Hong Jun Jeon, Benjamin Van Roy

However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable.

On Rate-Distortion Theory in Capacity-Limited Cognition & Reinforcement Learning

no code implementations30 Oct 2022 Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy

Throughout the cognitive-science literature, there is widespread agreement that decision-making agents operating in the real world do so under limited information-processing capabilities and without access to unbounded cognitive or computational resources.

Decision Making reinforcement-learning +1

Posterior Sampling for Continuing Environments

no code implementations29 Nov 2022 Wanqiao Xu, Shi Dong, Benjamin Van Roy

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments.

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

no code implementations2 Dec 2022 Hong Jun Jeon, Benjamin Van Roy

For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes.

Language Modelling

Inclusive Artificial Intelligence

no code implementations24 Dec 2022 Dilip Arumugam, Shi Dong, Benjamin Van Roy

Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual.

Leveraging Demonstrations to Improve Online Learning: Quality Matters

no code implementations7 Feb 2023 Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen

This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level.

Thompson Sampling

A Definition of Non-Stationary Bandits

no code implementations23 Feb 2023 Yueyang Liu, Xu Kuang, Benjamin Van Roy

Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones.

Bayesian Reinforcement Learning with Limited Cognitive Load

no code implementations5 May 2023 Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy

All biological and artificial agents must learn and make decisions given limits on their ability to process information.

Decision Making reinforcement-learning

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

no code implementations19 May 2023 Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy

In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function.

Efficient Exploration Language Modelling +2

Scalable Neural Contextual Bandit for Recommender Systems

no code implementations26 Jun 2023 Zheqing Zhu, Benjamin Van Roy

In two distinct large-scale experiments with real-world tasks, ENR significantly boosts click-through rates and user ratings by at least 9% and 6% respectively compared to state-of-the-art neural contextual bandit algorithms.

Recommendation Systems Thompson Sampling

Continual Learning as Computationally Constrained Reinforcement Learning

no code implementations10 Jul 2023 Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, Benjamin Van Roy

The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning.

Continual Learning reinforcement-learning

A Definition of Continual Reinforcement Learning

no code implementations NeurIPS 2023 David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh

Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents.

Continual Learning reinforcement-learning

On the Convergence of Bounded Agents

no code implementations20 Jul 2023 David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh

Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing.

reinforcement-learning

Maintaining Plasticity in Continual Learning via Regenerative Regularization

no code implementations23 Aug 2023 Saurabh Kumar, Henrik Marklund, Benjamin Van Roy

In this paper, we propose L2 Init, a simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters.

Continual Learning L2 Regularization

Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

no code implementations11 Oct 2023 Zheqing Zhu, Yueyang Liu, Xu Kuang, Benjamin Van Roy

Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends.

Multi-Armed Bandits

RLHF and IIA: Perverse Incentives

no code implementations2 Dec 2023 Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA).

reinforcement-learning

Adaptive Crowdsourcing Via Self-Supervised Learning

no code implementations24 Jan 2024 Anmol Kagrecha, Henrik Marklund, Benjamin Van Roy, Hong Jun Jeon, Richard Zeckhauser

Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate.

Self-Supervised Learning

An Information-Theoretic Analysis of In-Context Learning

no code implementations28 Jan 2024 Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy

Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted.

In-Context Learning Meta-Learning

Efficient Exploration for LLMs

no code implementations1 Feb 2024 Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models.

Efficient Exploration Thompson Sampling

Cannot find the paper you are looking for? You can Submit a new open access paper.