no code implementations • 1 Sep 2023 • Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko
We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback.
1 code implementation • 22 May 2023 • Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári, Wataru Kumagai, Yutaka Matsuo
Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms.
no code implementations • 26 Mar 2023 • Mariana Vargas Vieyra, Pierre Ménard
We present a novel, alternative framework for learning generative models with goal-conditioned reinforcement learning.
1 code implementation • 23 Dec 2022 • Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko
Imperfect information games (IIG) are games in which each player only partially observes the current game state.
no code implementations • 27 May 2022 • Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári
In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model.
no code implementations • NeurIPS 2021 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
We consider a multi-armed bandit problem specified by a set of one-dimensional family exponential distributions endowed with a unimodal structure.
no code implementations • NeurIPS 2021 • Tadashi Kozuno, Pierre Ménard, Remi Munos, Michal Valko
We study the problem of learning a Nash equilibrium (NE) in an extensive game with imperfect information (EGII) through self-play.
no code implementations • 23 Nov 2021 • Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We introduce a generic strategy for provably efficient multi-goal exploration.
no code implementations • 18 Jun 2021 • James Cheshire, Pierre Ménard, Alexandra Carpentier
Taking $K$ as the number of arms, we consider the case where (i) the sequence of arm's means $(\mu_k)_{k=1}^K$ is monotonically increasing (MTBP) and (ii) the case where $(\mu_k)_{k=1}^K$ is concave (CTBP).
no code implementations • 11 Jun 2021 • Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko
We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play.
no code implementations • NeurIPS 2021 • Rianne de Heide, James Cheshire, Pierre Ménard, Alexandra Carpentier
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $\Delta$.
no code implementations • 7 Oct 2020 • Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode.
no code implementations • 27 Jul 2020 • Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko
Realistic environments often provide agents with very limited feedback.
no code implementations • 9 Jul 2020 • Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric.
no code implementations • 7 Jul 2020 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
[0, 1]^{\mathcal{A}\times\mathcal{B}}$ and by a given weight matrix $\omega\!=\!
no code implementations • ICML 2020 • Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko
We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits.
no code implementations • 30 Jun 2020 • Hassan Saber, Pierre Ménard, Odalric-Ambrym Maillard
This strategy is proven optimal.
no code implementations • 11 Jun 2020 • Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko
Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel.
no code implementations • NeurIPS 2020 • Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, Michal Valko
We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support.
1 code implementation • 12 Apr 2020 • Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric.
no code implementations • 24 Oct 2019 • Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko
We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS).
no code implementations • NeurIPS 2019 • Rémy Degenne, Wouter M. Koolen, Pierre Ménard
Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment.
no code implementations • 20 May 2019 • Pierre Ménard
We present a new algorithm based on an gradient ascent for a general Active Exploration bandit problem in the fixed confidence setting.
no code implementations • 13 Nov 2017 • Aurélien Garivier, Pierre Ménard, Laurent Rossi, Pierre Menard
We analyze the sample complexity of the thresholding bandit problem, with and without the assumption that the mean values of the arms are increasing.
no code implementations • 23 Feb 2017 • Pierre Ménard, Aurélien Garivier
We propose the kl-UCB ++ algorithm for regret minimization in stochastic bandit models with exponential families of distributions.
no code implementations • 23 Feb 2016 • Aurélien Garivier, Pierre Ménard, Gilles Stoltz
We revisit lower bounds on the regret in the case of multi-armed bandit problems.