Search Results for author: Bilal Piot

Found 39 papers, 16 papers with code

Multi-turn Reinforcement Learning from Preference Human Feedback

no code implementations23 May 2024 Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks.

reinforcement-learning Reinforcement Learning (RL)

Human Alignment of Large Language Models through Online Preference Optimisation

no code implementations13 Mar 2024 Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.

Generalized Preference Optimization: A Unified Approach to Offline Alignment

no code implementations8 Feb 2024 Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot

The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss.

Direct Language Model Alignment from Online AI Feedback

no code implementations7 Feb 2024 Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel

Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy.

Language Modelling

A General Theoretical Paradigm to Understand Learning from Human Preferences

1 code implementation18 Oct 2023 Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations.

Unlocking the Power of Representations in Long-term Novelty-based Exploration

no code implementations2 May 2023 Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot

We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space.

Atari Games Clustering +1

The Edge of Orthogonality: A Simple View of What Makes BYOL Tick

no code implementations9 Feb 2023 Pierre H. Richemond, Allison Tam, Yunhao Tang, Florian Strub, Bilal Piot, Felix Hill

With simple linear algebra, we show that when using a linear predictor, the optimal predictor is close to an orthogonal projection, and propose a general framework based on orthonormalization that enables to interpret and give intuition on why BYOL works.

Never Give Up: Learning Directed Exploration Strategies

6 code implementations ICLR 2020 Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, Charles Blundell

Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344. 0%.

Atari Games

Towards Consistent Performance on Atari using Expert Demonstrations

no code implementations ICLR 2019 Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, Olivier Pietquin

Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games.

Atari Games Reinforcement Learning (RL)

World Discovery Models

1 code implementation20 Feb 2019 Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires, Jean-bastien Grill, Florent Altché, Rémi Munos

As humans we are driven by a strong desire for seeking novelty in our world.

Neural Predictive Belief Representations

no code implementations15 Nov 2018 Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A. Pires, Rémi Munos

In partially observable domains it is important for the representation to encode a belief state, a sufficient statistic of the observations seen so far.

Decision Making Representation Learning

Observe and Look Further: Achieving Consistent Performance on Atari

no code implementations29 May 2018 Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, Olivier Pietquin

Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games.

Montezuma's Revenge Reinforcement Learning (RL)

Noisy Networks for Exploration

15 code implementations ICLR 2018 Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration.

Atari Games Efficient Exploration +2

Observational Learning by Reinforcement Learning

no code implementations20 Jun 2017 Diana Borsa, Bilal Piot, Rémi Munos, Olivier Pietquin

Observational learning is a type of learning that occurs as a function of observing, retaining and possibly replicating or imitating the behaviour of another agent.

reinforcement-learning Reinforcement Learning (RL)

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

no code implementations ICLR 2018 Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, Remi Munos

Our first contribution is a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting.

Atari Games Distributional Reinforcement Learning +1

Deep Q-learning from Demonstrations

5 code implementations12 Apr 2017 Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys

We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism.

Imitation Learning Q-Learning +1

End-to-end optimization of goal-driven and visually grounded dialogue systems

2 code implementations15 Mar 2017 Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, Olivier Pietquin

End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning.

Decoder Dialogue Management +2

Is the Bellman residual a bad proxy?

no code implementations NeurIPS 2017 Matthieu Geist, Bilal Piot, Olivier Pietquin

This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual.

reinforcement-learning Reinforcement Learning (RL)

Difference of Convex Functions Programming Applied to Control with Expert Data

no code implementations3 Jun 2016 Bilal Piot, Matthieu Geist, Olivier Pietquin

This paper reports applications of Difference of Convex functions (DC) programming to Learning from Demonstrations (LfD) and Reinforcement Learning (RL) with expert data.

General Classification reinforcement-learning +1

Difference of Convex Functions Programming for Reinforcement Learning

no code implementations NeurIPS 2014 Bilal Piot, Matthieu Geist, Olivier Pietquin

Controlling this residual allows controlling the distance to the optimal action-value function, and we show that minimizing an empirical norm of the OBR is consistant in the Vapnik sense.

reinforcement-learning Reinforcement Learning (RL)

Inverse Reinforcement Learning through Structured Classification

no code implementations NeurIPS 2012 Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal.

Classification General Classification +2

Cannot find the paper you are looking for? You can Submit a new open access paper.