no code implementations • 29 May 2024 • Arpit Agarwal, Nicolas Usunier, Alessandro Lazaric, Maximilian Nickel
In this paper we explore a new approach to recommender systems where we infer user utility based on their return probability to the platform rather than engagement signals.
no code implementations • 19 Mar 2024 • Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati
Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task.
no code implementations • 16 Mar 2024 • Ayoub Ghriss, Masashi Sugiyama, Alessandro Lazaric
The current thesis aims to explore the reinforcement learning field and build on existing methods to produce improved ones to tackle the problem of learning in high-dimensional and complex environments.
no code implementations • 7 Feb 2023 • Liyu Chen, Andrea Tirinzoni, Alessandro Lazaric, Matteo Pirotta
We leverage these results to design Layered Autonomous Exploration (LAE), a novel algorithm for AX that attains a sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L(1+\epsilon)}\Gamma_{L(1+\epsilon)} A \ln^{12}(S^{\rightarrow}_{L(1+\epsilon)})/\epsilon^2)$, where $S^{\rightarrow}_{L(1+\epsilon)}$ is the number of states that are incrementally $L(1+\epsilon)$-controllable, $A$ is the number of actions, and $\Gamma_{L(1+\epsilon)}$ is the branching factor of the transitions over such states.
1 code implementation • 5 Jan 2023 • Lina Mezghani, Sainbayar Sukhbaatar, Piotr Bojanowski, Alessandro Lazaric, Karteek Alahari
Developing agents that can execute multiple skills by learning from pre-collected datasets is an important problem in robotics, where online interaction with the environment is extremely time-consuming.
no code implementations • 19 Dec 2022 • Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs.
no code implementations • 4 Nov 2022 • Yifang Chen, Karthik Sankararaman, Alessandro Lazaric, Matteo Pirotta, Dmytro Karamshuk, Qifan Wang, Karishma Mandyam, Sinong Wang, Han Fang
We design a novel algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy.
no code implementations • 24 Oct 2022 • Andrea Tirinzoni, Matteo Papini, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta
We study the problem of representation learning in stochastic contextual linear bandits.
no code implementations • 18 Oct 2022 • Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric, Nicolas Usunier
We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context.
no code implementations • 10 Oct 2022 • Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i. e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general.
no code implementations • 4 Oct 2022 • Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao
We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class.
no code implementations • 21 Mar 2022 • Akram Erraqabi, Marlos C. Machado, Mingde Zhao, Sainbayar Sukhbaatar, Alessandro Lazaric, Ludovic Denoyer, Yoshua Bengio
In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping.
1 code implementation • 31 Jan 2022 • Denis Yarats, David Brandfonbrener, Hao liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, Lerrel Pinto
In this work, we propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL.
no code implementations • 30 Jan 2022 • Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco
Computing a Gaussian process (GP) posterior has a computational cost cubical in the number of historical points.
no code implementations • 13 Dec 2021 • Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds.
no code implementations • 2 Dec 2021 • Paul Luyo, Evrard Garcelon, Alessandro Lazaric, Matteo Pirotta
We first consider the setting of linear-mixture MDPs (Ayoub et al., 2020) (a. k. a.\ model-based setting) and provide an unified framework for analyzing joint and local differential private (DP) exploration.
no code implementations • 23 Nov 2021 • Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We introduce a generic strategy for provably efficient multi-goal exploration.
no code implementations • NeurIPS 2021 • Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.
1 code implementation • ICML Workshop URL 2021 • Pierre-Alexandre Kamienny, Jean Tarbouriech, Sylvain Lamprier, Alessandro Lazaric, Ludovic Denoyer
Learning meaningful behaviors in the absence of reward is a difficult problem in reinforcement learning.
no code implementations • 23 Jul 2021 • Rui Yuan, Robert M. Gower, Alessandro Lazaric
We then instantiate our theorems in different settings, where we both recover existing results and obtain improved sample complexity, e. g., $\widetilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity for the convergence to the global optimum for Fisher-non-degenerated parametrized policies.
8 code implementations • ICLR 2022 • Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto
We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control.
no code implementations • 24 Jun 2021 • Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs).
no code implementations • ICLR 2022 • Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, LiWei Wang, Simon S. Du
We also obtain a new upper bound for conservative low-rank MDP.
no code implementations • ICML Workshop URL 2021 • Akram Erraqabi, Mingde Zhao, Marlos C. Machado, Yoshua Bengio, Sainbayar Sukhbaatar, Ludovic Denoyer, Alessandro Lazaric
In this work, we introduce a method that explicitly couples representation learning with exploration when the agent is not provided with a uniform prior over the state space.
no code implementations • NeurIPS 2021 • Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
no code implementations • 8 Apr 2021 • Matteo Papini, Andrea Tirinzoni, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
We show that the regret is indeed never worse than the regret obtained by running LinUCB on the best representation (up to a $\ln M$ factor).
1 code implementation • 22 Feb 2021 • Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto
Unfortunately, in RL, representation learning is confounded with the exploratory experience of the agent -- learning a useful representation requires diverse data, while effective exploration is only possible with coherent representations.
no code implementations • 1 Jan 2021 • Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer
Meta-reinforcement learning aims at finding a policy able to generalize to new environments.
no code implementations • NeurIPS 2020 • Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We investigate the exploration of an unknown environment when no reward function is provided.
no code implementations • NeurIPS 2020 • Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric
Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure.
no code implementations • NeurIPS 2020 • Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill
There has been growing progress on theoretical analyses for provably efficient learning in MDPs with linear function approximation, but much of the existing work has made strong assumptions to enable exploration by conventional exploration frameworks.
no code implementations • NeurIPS 2021 • Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
no code implementations • ICML 2020 • Marc Abeille, Alessandro Lazaric
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
no code implementations • 10 Jul 2020 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We consider the problem of exploration-exploitation in communicating Markov Decision Processes.
no code implementations • 23 May 2020 • Andrea Tirinzoni, Alessandro Lazaric, Marcello Restelli
We study finite-armed stochastic bandits where the rewards of each arm might be correlated to those of other arms.
no code implementations • ICML 2020 • Leonardo Cella, Alessandro Lazaric, Massimiliano Pontil
The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution.
1 code implementation • 6 May 2020 • Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer
We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.
no code implementations • 6 Mar 2020 • Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric
Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.
no code implementations • ICML 2020 • Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting.
1 code implementation • ICML 2020 • Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco
Gaussian processes (GP) are one of the most successful frameworks to model uncertainty.
no code implementations • NeurIPS 2020 • Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta
In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior.
no code implementations • 8 Feb 2020 • Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward.
no code implementations • 8 Feb 2020 • Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
In this case, it is desirable to deploy online learning algorithms (e. g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself.
no code implementations • 30 Jan 2020 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We investigate concentration inequalities for Dirichlet and Multinomial random variables.
no code implementations • ICML 2020 • Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric
Many popular reinforcement learning problems (e. g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost.
1 code implementation • NeurIPS 2019 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs).
no code implementations • NeurIPS 2019 • Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard
We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent.
no code implementations • NeurIPS 2019 • Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill
We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points.
2 code implementations • 1 Nov 2019 • Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL).
1 code implementation • NeurIPS 2019 • Nicolas Carion, Gabriel Synnaeve, Alessandro Lazaric, Nicolas Usunier
While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training.
Multi-agent Reinforcement Learning reinforcement-learning +5
1 code implementation • ACL 2019 • Rahma Chaabouni, Eugene Kharitonov, Alessandro Lazaric, Emmanuel Dupoux, Marco Baroni
We train models to communicate about paths in a simple gridworld, using miniature languages that reflect or violate various natural language trends, such as the tendency to avoid redundancy or to minimize long-distance dependencies.
1 code implementation • 13 Mar 2019 • Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco
Moreover, we show that our procedure selects at most $\tilde{O}(d_{eff})$ points, where $d_{eff}$ is the effective dimension of the explored space, which is typically much smaller than both $d$ and $t$.
no code implementations • 28 Feb 2019 • Jean Tarbouriech, Alessandro Lazaric
As the noise level is initially unknown, we need to trade off the exploration of the environment to estimate the noise and the exploitation of these estimates to compute a policy maximizing the accuracy of the mean predictions.
no code implementations • 11 Dec 2018 • Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses.
no code implementations • NeurIPS 2018 • Romain Warlop, Alessandro Lazaric, Jérémie Mary
A common assumption in recommender systems (RS) is the existence of a best fixed recommendation strategy.
no code implementations • 27 Nov 2018 • Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko
In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary.
1 code implementation • NeurIPS 2018 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e. g., in mountain car, the product space of speed and position contains configurations that are not physically reachable).
no code implementations • ICML 2018 • Marc Abeille, Alessandro Lazaric
Thompson sampling (TS) is an effective approach to trade off exploration and exploration in reinforcement learning.
no code implementations • ICML 2018 • Daniele Calandriello, Alessandro Lazaric, Ioannis Koutis, Michal Valko
By constructing a spectrally-similar graph, we are able to bound the error induced by the sparsification for a variety of downstream tasks (e. g., SSL).
no code implementations • 27 Mar 2018 • Daniele Calandriello, Alessandro Lazaric, Michal Valko
In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that sequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $d_{eff}(\gamma)$ of the dataset.
1 code implementation • ICML 2018 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner
We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known.
no code implementations • NeurIPS 2017 • Daniele Calandriello, Alessandro Lazaric, Michal Valko
The embedded space is continuously updated to guarantee that the embedding remains accurate, and we show that the per-step cost only grows with the effective dimension of the problem and not with $T$.
no code implementations • NeurIPS 2017 • Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Emma Brunskill
The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i. e., options).
no code implementations • ICML 2017 • Daniele Calandriello, Alessandro Lazaric, Michal Valko
First-order KOCO methods such as functional gradient descent require only $\mathcal{O}(t)$ time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal $\mathcal{O}(\sqrt{T})$ regret.
no code implementations • 7 May 2017 • Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods.
no code implementations • 27 Mar 2017 • Marc Abeille, Alessandro Lazaric
Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret TS in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy.
no code implementations • 25 Mar 2017 • Ronan Fruit, Alessandro Lazaric
While a large body of empirical results show that temporally-extended actions and options may significantly affect the learning performance of an agent, the theoretical understanding of how and when options can be beneficial in online reinforcement learning is relatively limited.
no code implementations • ICML 2017 • Carlos Riquelme, Mohammad Ghavamzadeh, Alessandro Lazaric
We explore the sequential decision making problem where the goal is to estimate uniformly well a number of linear models, given a shared budget of random contexts independently sampled from a known distribution.
no code implementations • 20 Nov 2016 • Marc Abeille, Alessandro Lazaric
We derive an alternative proof for the regret of Thompson sampling (\ts) in the stochastic linear bandit setting.
no code implementations • 11 Nov 2016 • Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
We derive finite-time regret bounds for our algorithm with a weak dependence on the dimensionality of the observed space.
no code implementations • 13 Sep 2016 • Daniele Calandriello, Alessandro Lazaric, Michal Valko
We derive a new proof to show that the incremental resparsification algorithm proposed by Kelner and Levin (2013) produces a spectral sparsifier in high probability.
no code implementations • 17 Aug 2016 • Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
Generally in RL, one can assume a generative model, e. g. graphical models, for the environment, and then the task for the RL agent is to learn the model parameters and find the optimal strategy based on these learnt parameters.
no code implementations • 25 Feb 2016 • Kamyar Azizzadenesheli, Alessandro Lazaric, Animashree Anandkumar
We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods.
no code implementations • 21 Jan 2016 • Daniele Calandriello, Alessandro Lazaric, Michal Valko, Ioannis Koutis
While the harmonic function solution performs well in many semi-supervised learning (SSL) tasks, it is known to scale poorly with the number of samples.
no code implementations • 16 Jul 2015 • Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Peter Auer, András Antos
If the variance of the distributions were known, one could design an optimal sampling strategy by collecting a number of independent samples per distribution that is proportional to their variance.
no code implementations • NeurIPS 2014 • Amir Sani, Gergely Neu, Alessandro Lazaric
We consider the problem of online optimization, where a learner chooses a decision from a given decision set and suffers some loss associated with the decision and the state of the environment.
no code implementations • NeurIPS 2014 • Daniele Calandriello, Alessandro Lazaric, Marcello Restelli
This is equivalent to assuming that the weight vectors of the task value functions are \textit{jointly sparse}, i. e., the set of their non-zero components is small and it is shared across tasks.
no code implementations • NeurIPS 2014 • Marta Soare, Alessandro Lazaric, Rémi Munos
We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter $\theta^*$ and the objective is to return the arm with the largest reward.
no code implementations • 4 Feb 2014 • Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill
In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback.
no code implementations • NeurIPS 2013 • Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill
Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents.
no code implementations • 5 May 2013 • Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill
In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors.
no code implementations • NeurIPS 2012 • Amir Sani, Alessandro Lazaric, Rémi Munos
In stochastic multi--armed bandits the objective is to solve the exploration--exploitation dilemma and ultimately maximize the expected reward.
no code implementations • NeurIPS 2012 • Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric
We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting.
no code implementations • NeurIPS 2011 • Alessandro Lazaric, Marcello Restelli
Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms.
no code implementations • NeurIPS 2011 • Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, Sébastien Bubeck
We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i. e., small gap).
no code implementations • NeurIPS 2010 • Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm.