no code implementations • 6 Sep 2024 • Sergio Calvo-Ordoñez, Konstantina Palla, Kamil Ciosek
Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process (GP) with the Neural Tangent Kernel (NTK) as the prior covariance and zero aleatoric noise \parencite{jacot2018neural}.
no code implementations • 3 Apr 2024 • Nicolò Felicioni, Lucas Maystre, Sina Ghiassian, Kamil Ciosek
We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy.
no code implementations • 13 Oct 2023 • Federico Tomasi, Joseph Cauteruccio, Surya Kanoria, Kamil Ciosek, Matteo Rinaldi, Zhenwen Dai
In this paper, we present a reinforcement learning framework that solves for such limitations by directly optimizing for user satisfaction metrics via the use of a simulated playlist-generation environment.
1 code implementation • 19 Jul 2023 • Thomas M. McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, Kamil Ciosek
In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards.
no code implementations • 6 Feb 2023 • Matthew Smith, Lucas Maystre, Zhenwen Dai, Kamil Ciosek
Imitation of expert behaviour is a highly desirable and safe approach to the problem of sequential decision making.
1 code implementation • ICLR 2022 • Kamil Ciosek
Imitation learning algorithms learn a policy from demonstrations of expert behavior.
1 code implementation • NeurIPS 2021 • David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
For many reinforcement learning (RL) applications, specifying a reward is difficult.
1 code implementation • 22 Jan 2021 • Tabish Rashid, Cheng Zhang, Kamil Ciosek
We show the benefits of using information gain as compared to the confidence interval criterion of ResponseGraphUCB (Rowland et al. 2019), and provide theoretical results justifying our method.
no code implementations • 18 Jan 2021 • Hisham Husain, Kamil Ciosek, Ryota Tomioka
Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy.
no code implementations • 14 Jan 2021 • Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah
We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness.
no code implementations • 11 Jan 2021 • Luisa Zintgraf, Sam Devlin, Kamil Ciosek, Shimon Whiteson, Katja Hofmann
The optimal adaptive behaviour under uncertainty over the other agents' strategies w. r. t.
no code implementations • 16 Jul 2020 • Luke Harries, Rebekah Storan Clarke, Timothy Chapman, Swamy V. P. L. N. Nallamalli, Levent Ozgur, Shuktika Jain, Alex Leung, Steve Lim, Aaron Dietrich, José Miguel Hernández-Lobato, Tom Ellis, Cheng Zhang, Kamil Ciosek
Efficient software testing is essential for productive software development and reliable user experiences.
1 code implementation • ICML 2020 • Ron Amit, Ron Meir, Kamil Ciosek
Specifying a Reinforcement Learning (RL) task involves choosing a suitable planning horizon, which is typically modeled by a discount factor.
no code implementations • ICLR 2020 • Jacob Beck, Kamil Ciosek, Sam Devlin, Sebastian Tschiatschek, Cheng Zhang, Katja Hofmann
In many partially observable scenarios, Reinforcement Learning (RL) agents must rely on long-term memory in order to learn an optimal policy.
no code implementations • ICLR 2020 • Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, Richard Turner
Obtaining high-quality uncertainty estimates is essential for many applications of deep neural networks.
1 code implementation • NeurIPS 2019 • Kamil Ciosek, Quan Vuong, Robert Loftin, Katja Hofmann
To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function.
1 code implementation • NeurIPS 2019 • Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, Katja Hofmann
We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL.
no code implementations • 28 Oct 2019 • Kamil Ciosek, Quan Vuong, Robert Loftin, Katja Hofmann
To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function.
no code implementations • NeurIPS 2020 • Jiachen Li, Quan Vuong, Shuang Liu, Minghua Liu, Kamil Ciosek, Keith Ross, Henrik Iskov Christensen, Hao Su
To perform well, the policy must infer the task identity from collected transitions by modelling its dependency on states, actions and rewards.
no code implementations • 25 Sep 2019 • Quan Vuong, Shuang Liu, Minghua Liu, Kamil Ciosek, Hao Su, Henrik Iskov Christensen
Combining ideas from Batch RL and Meta RL, we propose tiMe, which learns distillation of multiple value functions and MDP embeddings from only existing data.
no code implementations • ICML 2018 • Matthew Fellows, Kamil Ciosek, Shimon Whiteson
We propose a new way of deriving policy gradient updates for reinforcement learning.
no code implementations • 10 Jan 2018 • Kamil Ciosek, Shimon Whiteson
For Gaussian policies, we introduce an exploration method that uses covariance proportional to the matrix exponential of the scaled Hessian of the critic with respect to the actions.
no code implementations • 15 Jun 2017 • Kamil Ciosek, Shimon Whiteson
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning.
no code implementations • 24 May 2016 • Supratik Paul, Konstantinos Chatzilygeroudis, Kamil Ciosek, Jean-Baptiste Mouret, Michael A. Osborne, Shimon Whiteson
ALOQ is robust to the presence of significant rare events, which may not be observable under random sampling, but play a substantial role in determining the optimal policy.
no code implementations • 16 Jan 2015 • Kamil Ciosek, David Silver
This paper presents a way of solving Markov Decision Processes that combines state abstraction and temporal abstraction.
no code implementations • 22 Jan 2013 • Kamil Ciosek
This paper presents four different ways of looking at the well-known Least Squares Temporal Differences (LSTD) algorithm for computing the value function of a Markov Reward Process, each of them leading to different insights: the operator-theory approach via the Galerkin method, the statistical approach via instrumental variables, the linear dynamical system view as well as the limit of the TD iteration.