1 code implementation • 18 Feb 2023 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost.
1 code implementation • 3 Nov 2022 • Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, Geoffrey Irving
Language models often pre-train on large unsupervised text corpora, then fine-tune on additional task-specific data.
no code implementations • 1 Jul 2022 • Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy
However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning.
no code implementations • 8 Jun 2022 • Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy
In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions.
1 code implementation • 28 Feb 2022 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy
Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i. i. d.
1 code implementation • 9 Oct 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy
Predictive distributions quantify uncertainties ignored by point estimates.
no code implementations • 29 Sep 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Dieterich Lawson, Brendan O'Donoghue, Botao Hao, Benjamin Van Roy
This paper introduces \textit{The Neural Testbed}, which provides tools for the systematic evaluation of agents that generate such predictions.
no code implementations • 20 Jul 2021 • Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Mohammad Asghari, Benjamin Van Roy
A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes?
4 code implementations • 19 Jul 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.
no code implementations • 6 Mar 2021 • Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen
To illustrate concepts, we design simple agents that build on them and present computational results that highlight data efficiency.
no code implementations • ICLR 2020 • Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy
This generalizes and extends the use of ensembles to approximate Thompson sampling.
no code implementations • 9 Jun 2020 • Brendan O'Donoghue, Tor Lattimore, Ian Osband
We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff.
no code implementations • ICLR 2020 • Brendan O'Donoghue, Ian Osband, Catalin Ionescu
Reinforcement learning (RL) combines a control problem with statistical estimation: The system dynamics are not known to the agent, but can be learned through experience.
2 code implementations • ICLR 2020 • Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado van Hasselt
bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives.
no code implementations • 8 May 2019 • Pedro A. Ortega, Jane. X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M. Jayakumar, Tom McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero, Yee Whye Teh, Hado van Hasselt, Nando de Freitas, Matthew Botvinick, Shane Legg
In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class.
1 code implementation • NeurIPS 2018 • Ian Osband, John Aslanides, Albin Cassirer
Dealing with uncertainty is essential for efficient reinforcement learning.
1 code implementation • NeurIPS 2018 • Maria Dimakopoulou, Ian Osband, Benjamin Van Roy
We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale.
1 code implementation • ICML 2018 • Brendan O'Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih
In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps.
2 code implementations • 7 Jul 2017 • Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.
15 code implementations • ICLR 2018 • Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg
We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration.
Ranked #1 on
Atari Games
on Atari 2600 Surround
no code implementations • 13 Jun 2017 • Ian Osband, Benjamin Van Roy
We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning.
5 code implementations • 12 Apr 2017 • Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism.
no code implementations • 22 Mar 2017 • Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen
We study the use of randomized value functions to guide deep exploration in reinforcement learning.
1 code implementation • ICML 2017 • Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs.
no code implementations • 14 Feb 2017 • Ian Osband, Benjamin Van Roy
We consider the problem of sequential learning from categorical observations bounded in [0, 1].
1 code implementation • 9 Aug 2016 • Ian Osband, Benjamin Van Roy
- Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth.
no code implementations • 9 Aug 2016 • Ian Osband, Benjamin Van Roy
This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning.
no code implementations • ICML 2017 • Ian Osband, Benjamin Van Roy
Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms algorithms driven by optimism, such as UCRL2.
6 code implementations • NeurIPS 2016 • Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
Efficient exploration in complex environments remains a major challenge for reinforcement learning.
Ranked #6 on
Atari Games
on Atari 2600 Breakout
no code implementations • 1 Jul 2015 • Ian Osband, Benjamin Van Roy
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions.
no code implementations • NeurIPS 2014 • Ian Osband, Benjamin Van Roy
We consider the problem of learning to optimize an unknown Markov decision process (MDP).
Model-based Reinforcement Learning
reinforcement-learning
+1
no code implementations • NeurIPS 2014 • Ian Osband, Benjamin Van Roy
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces.
1 code implementation • 4 Feb 2014 • Ian Osband, Benjamin Van Roy, Zheng Wen
We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions.
no code implementations • NeurIPS 2013 • Ian Osband, Daniel Russo, Benjamin Van Roy
This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.