1 code implementation • ICML 2020 • REDA ALAMI, Odalric-Ambrym Maillard, Raphaël Féraud
In this paper, we consider the problem of sequential change-point detection where both the change-points and the distributions before and after the change are assumed to be unknown.
no code implementations • 28 Sep 2023 • Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
In this setting, accommodating potentially unbounded corruptions, we establish a problem-dependent lower bound on regret for a given family of arm distributions.
no code implementations • 19 Sep 2023 • Tuan Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo D'Eramo, Odalric-Ambrym Maillard
We introduce a novel backup operator that computes value nodes as the Wasserstein barycenter of their action-value children nodes; thus, propagating the uncertainty of the estimate across the tree to the root node.
no code implementations • 19 Jun 2023 • Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Medeiros Centa, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux
When comparing several RL algorithms, a major question is how many executions must be made and how can we ensure that the results of such a comparison are theoretically sound.
no code implementations • 5 Oct 2022 • Reda Ouhamma, Debabrota Basu, Odalric-Ambrym Maillard
Our regret bound is order-optimal with respect to $H$ and $K$.
no code implementations • 15 Sep 2022 • Patrick Saux, Odalric-Ambrym Maillard
In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback.
1 code implementation • 24 Aug 2022 • Mahsa Asadi, Aurélien Bellet, Odalric-Ambrym Maillard, Marc Tommasi
We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents.
no code implementations • 7 Jul 2022 • Romain Gautron, Emilio J. Padrón, Philippe Preux, Julien Bigot, Odalric-Ambrym Maillard, David Emukpere
gym-DSSAT is a gym interface to the Decision Support System for Agrotechnology Transfer (DSSAT), a high fidelity crop simulator.
no code implementations • 7 Mar 2022 • Debabrota Basu, Odalric-Ambrym Maillard, Timothée Mathieu
We study the corrupted bandit problem, i. e. a stochastic multi-armed bandit problem with $k$ unknown reward distributions, which are heavy-tailed and corrupted by a history-independent adversary or Nature.
no code implementations • 18 Jan 2022 • Sayak Ray Chowdhury, Patrick Saux, Odalric-Ambrym Maillard, Aditya Gopalan
For the practitioner, we instantiate this novel bound to several classical families, e. g., Gaussian, Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain.
no code implementations • NeurIPS 2021 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
We consider a multi-armed bandit problem specified by a set of one-dimensional family exponential distributions endowed with a unimodal structure.
no code implementations • NeurIPS 2021 • Dorian Baudry, Patrick Saux, Odalric-Ambrym Maillard
The stochastic multi-arm bandit problem has been extensively studied under standard assumptions on the arm's distribution (e. g bounded with known support, exponential family, etc).
1 code implementation • NeurIPS 2021 • Fabien Pesquerel, Hassan Saber, Odalric-Ambrym Maillard
For this structured problem of practical relevance, we first derive the asymptotic regret lower bound and corresponding constrained optimization problem.
no code implementations • 18 Nov 2021 • Dorian Baudry, Patrick Saux, Odalric-Ambrym Maillard
The stochastic multi-arm bandit problem has been extensively studied under standard assumptions on the arm's distribution (e. g bounded with known support, exponential family, etc).
1 code implementation • NeurIPS 2020 • Dorian Baudry, Emilie Kaufmann, Odalric-Ambrym Maillard
In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions).
no code implementations • ICLR 2021 • Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux
We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms.
no code implementations • 9 Sep 2020 • Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard
We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP).
no code implementations • 20 Jul 2020 • Edouard Leurent, Denis Efimov, Odalric-Ambrym Maillard
We consider the problem of stabilization of a linear system, under state and control constraints, and subject to bounded disturbances and unknown parameters in the state matrix.
no code implementations • 7 Jul 2020 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
[0, 1]^{\mathcal{A}\times\mathcal{B}}$ and by a given weight matrix $\omega\!=\!
no code implementations • 30 Jun 2020 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
This strategy is proven optimal.
no code implementations • ICML 2020 • Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair.
no code implementations • NeurIPS 2020 • Edouard Leurent, Denis Efimov, Odalric-Ambrym Maillard
We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust).
no code implementations • NeurIPS 2019 • Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard
We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent.
no code implementations • 9 Oct 2019 • Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard
In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 30 May 2019 • Subhojyoti Mukherjee, Odalric-Ambrym Maillard
The second strategy \ImpCPD makes use of the knowledge of $T$ to achieve the order optimal regret bound of $\min\big\lbrace O(\sum\limits_{i=1}^{K} \sum\limits_{g=1}^{G}\frac{\log(T/H_{1, g})}{\Delta^{opt}_{i, g}}), O(\sqrt{GT})\big\rbrace$, (where $H_{1, g}$ is the problem complexity) thereby closing an important gap with respect to the lower bound in a specific challenging setting.
no code implementations • NeurIPS 2019 • Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain.
no code implementations • 9 Apr 2019 • Edouard Leurent, Odalric-Ambrym Maillard
We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies - i. e. sequences of actions - and under budget constraint.
1 code implementation • NeurIPS 2019 • Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin
A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints.
no code implementations • 1 Mar 2019 • Edouard Leurent, Yann Blanco, Denis Efimov, Odalric-Ambrym Maillard
This work studies the design of safe control policies for large-scale non-linear systems operating in uncertain environments.
Systems and Control Robotics
no code implementations • 5 Feb 2019 • Lilian Besson, Emilie Kaufmann, Odalric-Ambrym Maillard, Julien Seznec
We introduce GLR-klUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards.
no code implementations • 5 Mar 2018 • Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset.
no code implementations • 31 Aug 2017 • Jaouad Mourtada, Odalric-Ambrym Maillard
By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial.
no code implementations • 2 Aug 2017 • Audrey Durand, Odalric-Ambrym Maillard, Joelle Pineau
The variance of the noise is not assumed to be known.
no code implementations • ICML 2017 • Borja Balle, Odalric-Ambrym Maillard
We present spectral methods of moments for learning sequential models from a single trajectory, in stark contrast with the classical literature that assumes the availability of multiple i. i. d.
no code implementations • 24 May 2017 • Odalric-Ambrym Maillard
We consider parametric exponential families of dimension $K$ on the real line.
no code implementations • 7 Sep 2016 • Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard
For the best-arm identification task, we introduce a version of Successive Elimination based on random shuffling of the $K$ arms.
no code implementations • 6 Sep 2016 • Aditya Gopalan, Odalric-Ambrym Maillard, Mohammadi Zaki
This induces a low-rank structure on the matrix of expected rewards r a, b from recommending item a to user b.
no code implementations • NeurIPS 2014 • Odalric-Ambrym Maillard, Timothy A. Mann, Shie Mannor
In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel $p$.
no code implementations • 12 May 2014 • Ronald Ortner, Odalric-Ambrym Maillard, Daniil Ryabko
We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP).
no code implementations • NeurIPS 2012 • Alexandra Carpentier, Odalric-Ambrym Maillard
We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d.
no code implementations • NeurIPS 2012 • Odalric-Ambrym Maillard
This paper aims to take a step forwards making the term ``intrinsic motivation'' from reinforcement learning theoretically well founded, focusing on curiosity-driven learning.
no code implementations • NeurIPS 2011 • Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos
Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several).
no code implementations • NeurIPS 2011 • Alexandra Carpentier, Odalric-Ambrym Maillard, Rémi Munos
We consider the problem of recovering the parameter alpha in R^K of a sparse function f, i. e. the number of non-zero entries of alpha is small compared to the number K of features, given noisy evaluations of f at a set of well-chosen sampling points.