Search Results for author: Benjamin Van Roy

Found 46 papers, 8 papers with code

Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

1 code implementation9 Oct 2021 Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy

This paper introduces \textit{The Neural Testbed}, which provides tools for the systematic evaluation of agents that generate such predictions.

Deep Exploration for Recommendation Systems

no code implementations26 Sep 2021 Zheqing Zhu, Benjamin Van Roy

We investigate the design of recommendation systems that can efficiently learn from sparse and delayed feedback.

Recommendation Systems

Evaluating Probabilistic Inference in Deep Learning: Beyond Marginal Predictions

no code implementations20 Jul 2021 Xiuyuan Lu, Ian Osband, Benjamin Van Roy, Zheng Wen

A fundamental challenge for any intelligent system is prediction: given some inputs $X_1,.., X_\tau$ can you predict outcomes $Y_1,.., Y_\tau$.

Epistemic Neural Networks

1 code implementation19 Jul 2021 Ian Osband, Zheng Wen, Mohammad Asghari, Morteza Ibrahimi, Xiyuan Lu, Benjamin Van Roy

All existing approaches to uncertainty modeling can be expressed as ENNs, and any ENN can be identified with a Bayesian neural network.

Reinforcement Learning, Bit by Bit

no code implementations6 Mar 2021 Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen

Reinforcement learning agents have demonstrated remarkable achievements in simulated environments.

A Bit Better? Quantifying Information for Bandit Learning

no code implementations18 Feb 2021 Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation.

Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States

no code implementations10 Feb 2021 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent.

Q-Learning Representation Learning

Deciding What to Learn: A Rate-Distortion Approach

no code implementations15 Jan 2021 Dilip Arumugam, Benjamin Van Roy

Agents that learn to select optimal actions represent a prominent focus of the sequential decision-making literature.

Decision Making

On Efficiency in Hierarchical Reinforcement Learning

no code implementations NeurIPS 2020 Zheng Wen, Doina Precup, Morteza Ibrahimi, Andre Barreto, Benjamin Van Roy, Satinder Singh

Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency.

Decision Making Hierarchical Reinforcement Learning

Randomized Value Functions via Posterior State-Abstraction Sampling

no code implementations5 Oct 2020 Dilip Arumugam, Benjamin Van Roy

State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms.

Langevin DQN

2 code implementations17 Feb 2020 Vikranth Dwaracherla, Benjamin Van Roy

Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions.

Provably Efficient Reinforcement Learning with Aggregated States

no code implementations13 Dec 2019 Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state.

Q-Learning

Information-Theoretic Confidence Bounds for Reinforcement Learning

no code implementations NeurIPS 2019 Xiuyuan Lu, Benjamin Van Roy

We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling.

Comments on the Du-Kakade-Wang-Yang Lower Bounds

no code implementations18 Nov 2019 Benjamin Van Roy, Shi Dong

Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable.

Behaviour Suite for Reinforcement Learning

2 code implementations ICLR 2020 Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado van Hasselt

bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives.

On the Performance of Thompson Sampling on Logistic Bandits

no code implementations12 May 2019 Shi Dong, Tengyu Ma, Benjamin Van Roy

Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.

An Information-Theoretic Analysis for Thompson Sampling with Many Actions

no code implementations NeurIPS 2018 Shi Dong, Benjamin Van Roy

We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.

Scalable Coordinated Exploration in Concurrent Reinforcement Learning

1 code implementation NeurIPS 2018 Maria Dimakopoulou, Ian Osband, Benjamin Van Roy

We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale.

Satisficing in Time-Sensitive Bandit Learning

no code implementations7 Mar 2018 Daniel Russo, Benjamin Van Roy

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action.

Coordinated Exploration in Concurrent Reinforcement Learning

no code implementations ICML 2018 Maria Dimakopoulou, Benjamin Van Roy

We consider a team of reinforcement learning agents that concurrently learn to operate in a common environment.

Learning to Price with Reference Effects

no code implementations29 Aug 2017 Abbas Kazerouni, Benjamin Van Roy

As a firm varies the price of a product, consumers exhibit reference effects, making purchase decisions based not only on the prevailing price but also the product's price history.

A Tutorial on Thompson Sampling

3 code implementations7 Jul 2017 Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.

Active Learning Product Recommendation

On Optimistic versus Randomized Exploration in Reinforcement Learning

no code implementations13 Jun 2017 Ian Osband, Benjamin Van Roy

We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning.

Ensemble Sampling

no code implementations NeurIPS 2017 Xiuyuan Lu, Benjamin Van Roy

Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems.

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

no code implementations28 Apr 2017 Daniel Russo, David Tse, Benjamin Van Roy

We propose satisficing Thompson sampling -- a variation of Thompson sampling -- and establish a strong discounted regret bound for this new algorithm.

Deep Exploration via Randomized Value Functions

no code implementations22 Mar 2017 Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen

We study the use of randomized value functions to guide deep exploration in reinforcement learning.

Efficient Exploration

Gaussian-Dirichlet Posterior Dominance in Sequential Learning

no code implementations14 Feb 2017 Ian Osband, Benjamin Van Roy

We consider the problem of sequential learning from categorical observations bounded in [0, 1].

Conservative Contextual Linear Bandits

no code implementations NeurIPS 2017 Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, Benjamin Van Roy

We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant (does not grow with the time horizon) term that accounts for the loss of being conservative in order to satisfy the safety constraint.

Decision Making

On Lower Bounds for Regret in Reinforcement Learning

no code implementations9 Aug 2016 Ian Osband, Benjamin Van Roy

This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning.

Posterior Sampling for Reinforcement Learning Without Episodes

no code implementations9 Aug 2016 Ian Osband, Benjamin Van Roy

- Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth.

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

no code implementations ICML 2017 Ian Osband, Benjamin Van Roy

Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms algorithms driven by optimism, such as UCRL2.

Bootstrapped Thompson Sampling and Deep Exploration

no code implementations1 Jul 2015 Ian Osband, Benjamin Van Roy

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions.

Learning to Optimize via Information-Directed Sampling

no code implementations NeurIPS 2014 Daniel Russo, Benjamin Van Roy

We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback.

An Information-Theoretic Analysis of Thompson Sampling

no code implementations21 Mar 2014 Daniel Russo, Benjamin Van Roy

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback.

Near-optimal Reinforcement Learning in Factored MDPs

no code implementations NeurIPS 2014 Ian Osband, Benjamin Van Roy

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces.

Generalization and Exploration via Randomized Value Functions

1 code implementation4 Feb 2014 Ian Osband, Benjamin Van Roy, Zheng Wen

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions.

Efficient Exploration

Eluder Dimension and the Sample Complexity of Optimistic Exploration

no code implementations NeurIPS 2013 Daniel Russo, Benjamin Van Roy

This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms.

Efficient Exploration and Value Function Generalization in Deterministic Systems

no code implementations NeurIPS 2013 Zheng Wen, Benjamin Van Roy

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.

Efficient Exploration

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

no code implementations18 Jul 2013 Zheng Wen, Benjamin Van Roy

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.

Efficient Exploration

(More) Efficient Reinforcement Learning via Posterior Sampling

no code implementations NeurIPS 2013 Ian Osband, Daniel Russo, Benjamin Van Roy

This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Efficient Exploration

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

no code implementations NeurIPS 2012 Morteza Ibrahimi, Adel Javanmard, Benjamin Van Roy

In particular, our algorithm has an average cost of $(1+\eps)$ times the optimum cost after $T = \polylog(p) O(1/\eps^2)$.

Learning to Optimize Via Posterior Sampling

no code implementations11 Jan 2013 Daniel Russo, Benjamin Van Roy

This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems.

Adaptive Execution: Exploration and Learning of Price Impact

no code implementations26 Jul 2012 Beomsoo Park, Benjamin Van Roy

The trader must learn coefficients of a price impact model while trading.

Trading and Market Microstructure

Cannot find the paper you are looking for? You can Submit a new open access paper.