no code implementations • 6 Aug 2024 • Saurabh Kumar, Hong Jun Jeon, Alex Lewandowski, Benjamin Van Roy
The "small agent, big world" frame offers a conceptual view that motivates the need for continual learning.
no code implementations • 17 Jul 2024 • Hong Jun Jeon, Benjamin Van Roy
Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify the analysis of many phenomena in machine learning.
no code implementations • 16 Jul 2024 • Dilip Arumugam, Wanqiao Xu, Benjamin Van Roy
A sequential decision-making agent balances between exploring to gain new knowledge about an environment and exploiting current knowledge to maximize immediate reward.
no code implementations • 16 Jul 2024 • Dilip Arumugam, Saurabh Kumar, Ramki Gummadi, Benjamin Van Roy
In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies.
no code implementations • 1 Feb 2024 • Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models.
no code implementations • 28 Jan 2024 • Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy
Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted.
no code implementations • 24 Jan 2024 • Anmol Kagrecha, Henrik Marklund, Benjamin Van Roy, Hong Jun Jeon, Richard Zeckhauser
Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate.
no code implementations • 2 Dec 2023 • Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA).
no code implementations • 11 Oct 2023 • Zheqing Zhu, Yueyang Liu, Xu Kuang, Benjamin Van Roy
Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends.
no code implementations • 23 Aug 2023 • Saurabh Kumar, Henrik Marklund, Benjamin Van Roy
In this paper, we propose L2 Init, a simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters.
no code implementations • NeurIPS 2023 • David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh
Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents.
no code implementations • 20 Jul 2023 • David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh
Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing.
no code implementations • 10 Jul 2023 • Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, Benjamin Van Roy
The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning.
no code implementations • 26 Jun 2023 • Zheqing Zhu, Benjamin Van Roy
In two distinct large-scale experiments with real-world tasks, ENR significantly boosts click-through rates and user ratings by at least 9% and 6% respectively compared to state-of-the-art neural contextual bandit algorithms.
no code implementations • 19 May 2023 • Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy
In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function.
no code implementations • 5 May 2023 • Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy
All biological and artificial agents must learn and make decisions given limits on their ability to process information.
no code implementations • 23 Feb 2023 • Yueyang Liu, Xu Kuang, Benjamin Van Roy
Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones.
1 code implementation • 18 Feb 2023 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost.
no code implementations • 7 Feb 2023 • Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen
This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level.
no code implementations • 24 Dec 2022 • Dilip Arumugam, Shi Dong, Benjamin Van Roy
Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual.
no code implementations • 2 Dec 2022 • Hong Jun Jeon, Benjamin Van Roy
For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes.
no code implementations • 29 Nov 2022 • Wanqiao Xu, Shi Dong, Benjamin Van Roy
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments.
1 code implementation • 3 Nov 2022 • Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, Geoffrey Irving
Language models often pre-train on large unsupervised text corpora, then fine-tune on additional task-specific data.
no code implementations • 30 Oct 2022 • Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy
Throughout the cognitive-science literature, there is widespread agreement that decision-making agents operating in the real world do so under limited information-processing capabilities and without access to unbounded cognitive or computational resources.
no code implementations • 18 Sep 2022 • Yifan Zhu, Hong Jun Jeon, Benjamin Van Roy
However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable.
no code implementations • 1 Jul 2022 • Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy
However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning.
no code implementations • 8 Jun 2022 • Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy
In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions.
no code implementations • 4 Jun 2022 • Dilip Arumugam, Benjamin Van Roy
To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model.
no code implementations • 4 Jun 2022 • Dilip Arumugam, Benjamin Van Roy
The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment.
no code implementations • 4 May 2022 • Yueyang Liu, Xu Kuang, Benjamin Van Roy
We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity.
no code implementations • 2 Mar 2022 • Chao Qin, Zheng Wen, Xiuyuan Lu, Benjamin Van Roy
Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable.
no code implementations • 1 Mar 2022 • Hong Jun Jeon, Yifan Zhu, Benjamin Van Roy
For a particular prior distribution on weights, we establish sample complexity bounds that are simultaneously width independent and linear in depth.
1 code implementation • 28 Feb 2022 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy
Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i. i. d.
no code implementations • 6 Jan 2022 • Yueyang Liu, Adithya M. Devraj, Benjamin Van Roy, Kuang Xu
We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit.
no code implementations • NeurIPS 2021 • Dilip Arumugam, Benjamin Van Roy
All sequential decision-making agents explore so as to acquire knowledge about a particular target.
1 code implementation • 9 Oct 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy
Predictive distributions quantify uncertainties ignored by point estimates.
no code implementations • 29 Sep 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Dieterich Lawson, Brendan O'Donoghue, Botao Hao, Benjamin Van Roy
This paper introduces \textit{The Neural Testbed}, which provides tools for the systematic evaluation of agents that generate such predictions.
no code implementations • 26 Sep 2021 • Zheqing Zhu, Benjamin Van Roy
Where past work has aimed to learn from subsequent behavior, there has been a lack of effective methods for probing to elicit informative delayed feedback.
no code implementations • 20 Jul 2021 • Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Mohammad Asghari, Benjamin Van Roy
A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes?
1 code implementation • NeurIPS 2023 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.
no code implementations • 6 Mar 2021 • Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen
To illustrate concepts, we design simple agents that build on them and present computational results that highlight data efficiency.
no code implementations • 18 Feb 2021 • Adithya M. Devraj, Benjamin Van Roy, Kuang Xu
The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation.
no code implementations • 10 Feb 2021 • Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent.
no code implementations • 15 Jan 2021 • Dilip Arumugam, Benjamin Van Roy
Agents that learn to select optimal actions represent a prominent focus of the sequential decision-making literature.
no code implementations • NeurIPS 2020 • Zheng Wen, Doina Precup, Morteza Ibrahimi, Andre Barreto, Benjamin Van Roy, Satinder Singh
Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency.
no code implementations • 5 Oct 2020 • Dilip Arumugam, Benjamin Van Roy
State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms.
no code implementations • ICLR 2020 • Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy
This generalizes and extends the use of ensembles to approximate Thompson sampling.
2 code implementations • 17 Feb 2020 • Vikranth Dwaracherla, Benjamin Van Roy
Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions.
no code implementations • 13 Dec 2019 • Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state.
no code implementations • NeurIPS 2019 • Xiuyuan Lu, Benjamin Van Roy
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling.
no code implementations • 18 Nov 2019 • Benjamin Van Roy, Shi Dong
Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable.
3 code implementations • ICLR 2020 • Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado van Hasselt
bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives.
no code implementations • 12 May 2019 • Shi Dong, Tengyu Ma, Benjamin Van Roy
Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$.
no code implementations • NeurIPS 2018 • Shi Dong, Benjamin Van Roy
We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.
1 code implementation • NeurIPS 2018 • Maria Dimakopoulou, Ian Osband, Benjamin Van Roy
We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale.
no code implementations • 7 Mar 2018 • Daniel Russo, Benjamin Van Roy
Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action.
no code implementations • ICML 2018 • Maria Dimakopoulou, Benjamin Van Roy
We consider a team of reinforcement learning agents that concurrently learn to operate in a common environment.
no code implementations • 29 Aug 2017 • Abbas Kazerouni, Benjamin Van Roy
As a firm varies the price of a product, consumers exhibit reference effects, making purchase decisions based not only on the prevailing price but also the product's price history.
2 code implementations • 7 Jul 2017 • Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.
no code implementations • 13 Jun 2017 • Ian Osband, Benjamin Van Roy
We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning.
no code implementations • NeurIPS 2017 • Xiuyuan Lu, Benjamin Van Roy
Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems.
no code implementations • 28 Apr 2017 • Daniel Russo, David Tse, Benjamin Van Roy
We propose satisficing Thompson sampling -- a variation of Thompson sampling -- and establish a strong discounted regret bound for this new algorithm.
no code implementations • 22 Mar 2017 • Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen
We study the use of randomized value functions to guide deep exploration in reinforcement learning.
no code implementations • 14 Feb 2017 • Ian Osband, Benjamin Van Roy
We consider the problem of sequential learning from categorical observations bounded in [0, 1].
no code implementations • NeurIPS 2017 • Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, Benjamin Van Roy
We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant (does not grow with the time horizon) term that accounts for the loss of being conservative in order to satisfy the safety constraint.
no code implementations • 9 Aug 2016 • Ian Osband, Benjamin Van Roy
This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning.
1 code implementation • 9 Aug 2016 • Ian Osband, Benjamin Van Roy
- Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth.
no code implementations • ICML 2017 • Ian Osband, Benjamin Van Roy
Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms algorithms driven by optimism, such as UCRL2.
6 code implementations • NeurIPS 2016 • Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
Efficient exploration in complex environments remains a major challenge for reinforcement learning.
Ranked #5 on Atari Games on Atari 2600 Breakout
no code implementations • 1 Jul 2015 • Ian Osband, Benjamin Van Roy
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions.
no code implementations • NeurIPS 2014 • Ian Osband, Benjamin Van Roy
We consider the problem of learning to optimize an unknown Markov decision process (MDP).
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 21 Mar 2014 • Daniel Russo, Benjamin Van Roy
We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback.
no code implementations • NeurIPS 2014 • Daniel Russo, Benjamin Van Roy
We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback.
no code implementations • NeurIPS 2014 • Ian Osband, Benjamin Van Roy
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces.
1 code implementation • 4 Feb 2014 • Ian Osband, Benjamin Van Roy, Zheng Wen
We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions.
no code implementations • NeurIPS 2013 • Zheng Wen, Benjamin Van Roy
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.
no code implementations • NeurIPS 2013 • Daniel Russo, Benjamin Van Roy
This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms.
no code implementations • 18 Jul 2013 • Zheng Wen, Benjamin Van Roy
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.
no code implementations • NeurIPS 2013 • Ian Osband, Daniel Russo, Benjamin Van Roy
This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
no code implementations • NeurIPS 2012 • Morteza Ibrahimi, Adel Javanmard, Benjamin Van Roy
In particular, our algorithm has an average cost of $(1+\eps)$ times the optimum cost after $T = \polylog(p) O(1/\eps^2)$.
no code implementations • 11 Jan 2013 • Daniel Russo, Benjamin Van Roy
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems.
no code implementations • 26 Jul 2012 • Beomsoo Park, Benjamin Van Roy
The trader must learn coefficients of a price impact model while trading.
Trading and Market Microstructure