Search Results for author: Bruno Scherrer

Found 22 papers, 0 papers with code

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

no code implementations17 Mar 2021 Lin Chen, Bruno Scherrer, Peter L. Bartlett

In this regime, for any $q\in[\gamma^{2}, 1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$.

Off-policy evaluation

Leverage the Average: an Analysis of KL Regularization in RL

no code implementations31 Mar 2020 Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.

A Theory of Regularized Markov Decision Processes

no code implementations31 Jan 2019 Matthieu Geist, Bruno Scherrer, Olivier Pietquin

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence.

Q-Learning

Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning

no code implementations NeurIPS 2018 Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.

reinforcement-learning reinforcement Learning

Anderson Acceleration for Reinforcement Learning

no code implementations25 Sep 2018 Matthieu Geist, Bruno Scherrer

Anderson acceleration is an old and simple method for accelerating the computation of a fixed point.

reinforcement-learning reinforcement Learning

How to Combine Tree-Search Methods in Reinforcement Learning

no code implementations6 Sep 2018 Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success.

reinforcement-learning reinforcement Learning

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

no code implementations21 May 2018 Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.

reinforcement-learning reinforcement Learning

Rate of Convergence and Error Bounds for LSTD($λ$)

no code implementations13 May 2014 Manel Tagorti, Bruno Scherrer

We consider LSTD($\lambda$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002).

Approximate Policy Iteration Schemes: A Comparison

no code implementations12 May 2014 Bruno Scherrer

2) PSDP$_\infty$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API.

Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

no code implementations NeurIPS 2013 Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results.

Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee

no code implementations6 Jun 2013 Bruno Scherrer, Matthieu Geist

Local Policy Search is a popular reinforcement learning approach for handling large state spaces.

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

no code implementations3 Jun 2013 Bruno Scherrer

We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012).

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

no code implementations NeurIPS 2013 Bruno Scherrer

We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximaladvantage.

Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies

no code implementations20 Apr 2013 Boris Lesner, Bruno Scherrer

For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor $O(1-\gamma)$, which is significant when the discount factor $\gamma$ is close to 1.

Off-policy Learning with Eligibility Traces: A Survey

no code implementations15 Apr 2013 Matthieu Geist, Bruno Scherrer

In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy.

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

no code implementations NeurIPS 2012 Bruno Scherrer, Boris Lesner

We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy.

Approximate Modified Policy Iteration

no code implementations14 May 2012 Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, Matthieu Geist

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods.

General Classification

Biasing Approximate Dynamic Programming with a Lower Discount Factor

no code implementations NeurIPS 2008 Marek Petrik, Bruno Scherrer

We thus propose another justification: when the rewards are received only sporadically (as it is the case in Tetris), we can derive tighter bounds, which support a significant performance increase with a decrease in the discount factor.

Cannot find the paper you are looking for? You can Submit a new open access paper.