no code implementations • 19 Mar 2024 • Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati
Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task.
no code implementations • 7 Feb 2023 • Liyu Chen, Andrea Tirinzoni, Alessandro Lazaric, Matteo Pirotta
We leverage these results to design Layered Autonomous Exploration (LAE), a novel algorithm for AX that attains a sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L(1+\epsilon)}\Gamma_{L(1+\epsilon)} A \ln^{12}(S^{\rightarrow}_{L(1+\epsilon)})/\epsilon^2)$, where $S^{\rightarrow}_{L(1+\epsilon)}$ is the number of states that are incrementally $L(1+\epsilon)$-controllable, $A$ is the number of actions, and $\Gamma_{L(1+\epsilon)}$ is the branching factor of the transitions over such states.
no code implementations • 19 Dec 2022 • Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs.
no code implementations • 4 Nov 2022 • Yifang Chen, Karthik Sankararaman, Alessandro Lazaric, Matteo Pirotta, Dmytro Karamshuk, Qifan Wang, Karishma Mandyam, Sinong Wang, Han Fang
We design a novel algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy.
no code implementations • 24 Oct 2022 • Andrea Tirinzoni, Matteo Papini, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta
We study the problem of representation learning in stochastic contextual linear bandits.
no code implementations • 18 Oct 2022 • Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric, Nicolas Usunier
We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context.
no code implementations • 10 Oct 2022 • Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i. e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general.
no code implementations • 13 Dec 2021 • Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds.
no code implementations • 11 Dec 2021 • Evrard Garcelon, Kamalika Chaudhuri, Vianney Perchet, Matteo Pirotta
Contextual bandit algorithms are widely used in domains where it is desirable to provide a personalized service by leveraging contextual information, that may contain sensitive information that needs to be protected.
no code implementations • 2 Dec 2021 • Paul Luyo, Evrard Garcelon, Alessandro Lazaric, Matteo Pirotta
We first consider the setting of linear-mixture MDPs (Ayoub et al., 2020) (a. k. a.\ model-based setting) and provide an unified framework for analyzing joint and local differential private (DP) exploration.
no code implementations • 23 Nov 2021 • Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We introduce a generic strategy for provably efficient multi-goal exploration.
no code implementations • NeurIPS 2021 • Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.
no code implementations • 24 Jun 2021 • Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs).
no code implementations • ICLR 2022 • Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, LiWei Wang, Simon S. Du
We also obtain a new upper bound for conservative low-rank MDP.
no code implementations • NeurIPS 2021 • Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
no code implementations • 8 Apr 2021 • Matteo Papini, Andrea Tirinzoni, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
We show that the regret is indeed never worse than the regret obtained by running LinUCB on the best representation (up to a $\ln M$ factor).
no code implementations • 17 Mar 2021 • Evrard Garcelon, Vianney Perchet, Matteo Pirotta
A critical aspect of bandit methods is that they require to observe the contexts --i. e., individual or group-level data-- and rewards in order to solve the sequential problem.
no code implementations • 1 Jan 2021 • Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer
Meta-reinforcement learning aims at finding a policy able to generalize to new environments.
no code implementations • NeurIPS 2020 • Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We investigate the exploration of an unknown environment when no reward function is provided.
no code implementations • NeurIPS 2020 • Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric
Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure.
no code implementations • NeurIPS 2021 • Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta
Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side.
no code implementations • NeurIPS 2021 • Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
no code implementations • 10 Jul 2020 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We consider the problem of exploration-exploitation in communicating Markov Decision Processes.
no code implementations • 9 Jul 2020 • Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric.
1 code implementation • 6 May 2020 • Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer
We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.
1 code implementation • 12 Apr 2020 • Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric.
no code implementations • 6 Mar 2020 • Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric
Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.
no code implementations • 4 Mar 2020 • Yonathan Efroni, Shie Mannor, Matteo Pirotta
In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities.
no code implementations • NeurIPS 2020 • Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta
In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior.
no code implementations • 8 Feb 2020 • Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
In this case, it is desirable to deploy online learning algorithms (e. g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself.
no code implementations • 8 Feb 2020 • Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward.
no code implementations • 30 Jan 2020 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We investigate concentration inequalities for Dirichlet and Multinomial random variables.
no code implementations • 13 Jan 2020 • Michiel van der Meer, Matteo Pirotta, Elia Bruni
In this work, we present an alternative approach to making an agent compositional through the use of a diagnostic classifier.
no code implementations • ICML 2020 • Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric
Many popular reinforcement learning problems (e. g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost.
1 code implementation • NeurIPS 2019 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs).
no code implementations • NeurIPS 2019 • Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard
We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent.
2 code implementations • 1 Nov 2019 • Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL).
no code implementations • 8 May 2019 • Matteo Papini, Matteo Pirotta, Marcello Restelli
Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics.
no code implementations • 11 Dec 2018 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses.
1 code implementation • NeurIPS 2018 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e. g., in mountain car, the product space of speed and position contains configurations that are not physically reachable).
1 code implementation • ICML 2018 • Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli
In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs).
no code implementations • ICML 2018 • Andrea Tirinzoni, Andrea Sessa, Matteo Pirotta, Marcello Restelli
In the proposed approach, all the samples are transferred and used by a batch RL algorithm to solve the target task, but their contribution to the learning process is proportional to their importance weight.
1 code implementation • ICML 2018 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner
We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known.
no code implementations • 9 Dec 2017 • Matteo Pirotta, Marcello Restelli
In this paper, we propose a novel approach to automatically determine the batch size in stochastic gradient descent methods.
no code implementations • NeurIPS 2017 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Emma Brunskill
The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i. e., options).
no code implementations • NeurIPS 2017 • Matteo Papini, Matteo Pirotta, Marcello Restelli
Policy gradient methods are among the best Reinforcement Learning (RL) techniques to solve complex control problems.
no code implementations • NeurIPS 2017 • Alberto Maria Metelli, Matteo Pirotta, Marcello Restelli
Within this subspace, using a second-order criterion, we search for the reward function that penalizes the most a deviation from the expert's policy.
no code implementations • ICML 2017 • Samuele Tosatto, Matteo Pirotta, Carlo D’Eramo, Marcello Restelli
This paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems.
no code implementations • 13 Jun 2014 • Matteo Pirotta, Simone Parisi, Marcello Restelli
The paper is about learning a continuous approximation of the Pareto frontier in Multi-Objective Markov Decision Problems (MOMDPs).
Multi-Objective Reinforcement Learning reinforcement-learning +1
no code implementations • NeurIPS 2013 • Matteo Pirotta, Marcello Restelli, Luca Bascetta
In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement--learning field.