Search Results for author: Michal Valko

Found 96 papers, 29 papers with code

A General Theoretical Paradigm to Understand Learning from Human Preferences

1 code implementation18 Oct 2023 Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations.

Zonotope hit-and-run for efficient sampling from projection DPPs

1 code implementation ICML 2017 Guillaume Gautier, Rémi Bardenet, Michal Valko

Previous theoretical results yield a fast mixing time of our chain when targeting a distribution that is close to a projection DPP, but not a DPP in general.

Point Processes Recommendation Systems

DPPy: Sampling DPPs with Python

2 code implementations19 Sep 2018 Guillaume Gautier, Guillermo Polito, Rémi Bardenet, Michal Valko

Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning.

BIG-bench Machine Learning Point Processes

Exact sampling of determinantal point processes with sublinear time preprocessing

2 code implementations NeurIPS 2019 Michał Dereziński, Daniele Calandriello, Michal Valko

For this purpose, we propose a new algorithm which, given access to $\mathbf{L}$, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is $n \cdot \text{poly}(k)$, i. e., sublinear in the size of $\mathbf{L}$, and (2) its sampling cost is $\text{poly}(k)$, i. e., independent of the size of $\mathbf{L}$.

Point Processes

On two ways to use determinantal point processes for Monte Carlo integration

1 code implementation NeurIPS 2019 Guillaume Gautier, Rémi Bardenet, Michal Valko

In the absence of DPP machinery to derive an efficient sampler and analyze their estimator, the idea of Monte Carlo integration with DPPs was stored in the cellar of numerical integration.

Numerical Integration Point Processes

Sampling from a k-DPP without looking at all items

1 code implementation NeurIPS 2020 Daniele Calandriello, Michal Derezinski, Michal Valko

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, recommendation, stochastic optimization, experimental design and more.

Experimental Design Point Processes +1

Large-Scale Representation Learning on Graphs via Bootstrapping

3 code implementations ICLR 2022 Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko

To address these challenges, we introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input.

Contrastive Learning Graph Representation Learning +1

Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback

1 code implementation NeurIPS 2017 Zheng Wen, Branislav Kveton, Michal Valko, Sharan Vaswani

Specifically, we aim to learn the set of "best influencers" in a social network online while repeatedly interacting with it.

Drop, Swap, and Generate: A Self-Supervised Approach for Generating Neural Activity

1 code implementation NeurIPS 2021 Ran Liu, Mehdi Azabou, Max Dabagia, Chi-Heng Lin, Mohammad Gheshlaghi Azar, Keith B. Hengen, Michal Valko, Eva L. Dyer

Our approach combines a generative modeling framework with an instance-specific alignment loss that tries to maximize the representational similarity between transformed views of the input (brain state).

Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

1 code implementation13 Mar 2019 Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco

Moreover, we show that our procedure selects at most $\tilde{O}(d_{eff})$ points, where $d_{eff}$ is the effective dimension of the explored space, which is typically much smaller than both $d$ and $t$.

Gaussian Processes

Compressing the Input for CNNs with the First-Order Scattering Transform

1 code implementation ECCV 2018 Edouard Oyallon, Eugene Belilovsky, Sergey Zagoruyko, Michal Valko

We study the first-order scattering transform as a candidate for reducing the signal processed by a convolutional neural network (CNN).

General Classification Translation

Adapting to game trees in zero-sum imperfect information games

1 code implementation23 Dec 2022 Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

Imperfect information games (IIG) are games in which each player only partially observes the current game state.

Kernel-Based Reinforcement Learning: A Finite-Time Analysis

1 code implementation12 Apr 2020 Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric.

reinforcement-learning Reinforcement Learning (RL)

UCB Momentum Q-learning: Correcting the bias without forgetting

1 code implementation1 Mar 2021 Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process.

Q-Learning

Multiagent Evaluation under Incomplete Information

1 code implementation NeurIPS 2019 Mark Rowland, Shayegan Omidshafiei, Karl Tuyls, Julien Perolat, Michal Valko, Georgios Piliouras, Remi Munos

This paper investigates the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents.

Planning in entropy-regularized Markov decision processes and games

1 code implementation NeurIPS 2019 Jean-bastien Grill, Omar Darwiche Domingues, Pierre Menard, Remi Munos, Michal Valko

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser.

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

1 code implementation28 Sep 2022 Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions.

reinforcement-learning Reinforcement Learning (RL)

Fast Rates for Maximum Entropy Exploration

1 code implementation14 Mar 2023 Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

Finally, we apply developed regularization techniques to reduce sample complexity of visitation entropy maximization to $\widetilde{\mathcal{O}}(H^2SA/\varepsilon^2)$, yielding a statistical separation between maximum entropy exploration and reward-free exploration.

Reinforcement Learning (RL)

Finding the bandit in a graph: Sequential search-and-stop

no code implementations6 Jun 2018 Pierre Perrault, Vianney Perchet, Michal Valko

We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution.

Multi-Armed Bandits

Distributed Adaptive Sampling for Kernel Matrix Approximation

no code implementations27 Mar 2018 Daniele Calandriello, Alessandro Lazaric, Michal Valko

In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that sequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $d_{eff}(\gamma)$ of the dataset.

Clustering

Second-Order Kernel Online Convex Optimization with Adaptive Sketching

no code implementations ICML 2017 Daniele Calandriello, Alessandro Lazaric, Michal Valko

First-order KOCO methods such as functional gradient descent require only $\mathcal{O}(t)$ time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal $\mathcal{O}(\sqrt{T})$ regret.

Second-order methods

Analysis of Kelner and Levin graph sparsification algorithm for a streaming setting

no code implementations13 Sep 2016 Daniele Calandriello, Alessandro Lazaric, Michal Valko

We derive a new proof to show that the incremental resparsification algorithm proposed by Kelner and Levin (2013) produces a spectral sparsifier in high probability.

Incremental Spectral Sparsification for Large-Scale Graph-Based Semi-Supervised Learning

no code implementations21 Jan 2016 Daniele Calandriello, Alessandro Lazaric, Michal Valko, Ioannis Koutis

While the harmonic function solution performs well in many semi-supervised learning (SSL) tasks, it is known to scale poorly with the number of samples.

Quantization

Cheap Bandits

no code implementations15 Jun 2015 Manjesh Kumar Hanawal, Venkatesh Saligrama, Michal Valko, R\' emi Munos

We consider stochastic sequential learning problems where the learner can observe the \textit{average reward of several actions}.

Simple regret for infinitely many armed bandits

no code implementations18 May 2015 Alexandra Carpentier, Michal Valko

As in the cumulative regret setting of infinitely many armed bandits, the rate of the simple regret will depend on a parameter $\beta$ characterizing the distribution of the near-optimal arms.

Learning to Act Greedily: Polymatroid Semi-Bandits

no code implementations30 May 2014 Branislav Kveton, Zheng Wen, Azin Ashkan, Michal Valko

Many important optimization problems, such as the minimum spanning tree and minimum-cost flow, can be solved optimally by a greedy method.

Finite-Time Analysis of Kernelised Contextual Bandits

no code implementations26 Sep 2013 Michal Valko, Nathaniel Korda, Remi Munos, Ilias Flaounas, Nelo Cristianini

For contextual bandits, the related algorithm GP-UCB turns out to be a special case of our algorithm, and our finite-time analysis improves the regret bound of GP-UCB for the agnostic case, both in the terms of the kernel-dependent quantity and the RKHS norm of the reward function.

Multi-Armed Bandits

A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

no code implementations1 Oct 2018 Peter L. Bartlett, Victor Gabillon, Michal Valko

The difficulty of optimization is measured in terms of 1) the amount of \emph{noise} $b$ of the function evaluation and 2) the local smoothness, $d$, of the function.

Efficient Second-Order Online Kernel Learning with Adaptive Embedding

no code implementations NeurIPS 2017 Daniele Calandriello, Alessandro Lazaric, Michal Valko

The embedded space is continuously updated to guarantee that the embedding remains accurate, and we show that the per-step cost only grows with the effective dimension of the problem and not with $T$.

Second-order methods

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

no code implementations NeurIPS 2016 Jean-bastien Grill, Michal Valko, Remi Munos

We study the sampling-based planning problem in Markov decision processes (MDPs) that we can access only through a generative model, usually referred to as Monte-Carlo planning.

Black-box optimization of noisy functions with unknown smoothness

no code implementations NeurIPS 2015 Jean-bastien Grill, Michal Valko, Remi Munos

We study the problem of black-box optimization of a function $f$ of any dimension, given function evaluations perturbed by noise.

Efficient learning by implicit exploration in bandit problems with side observations

no code implementations NeurIPS 2014 Tomáš Kocák, Gergely Neu, Michal Valko, Remi Munos

As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism.

Combinatorial Optimization

Extreme bandits

no code implementations NeurIPS 2014 Alexandra Carpentier, Michal Valko

In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values.

Network Intrusion Detection

Improved large-scale graph learning through ridge spectral sparsification

no code implementations ICML 2018 Daniele Calandriello, Alessandro Lazaric, Ioannis Koutis, Michal Valko

By constructing a spectrally-similar graph, we are able to bound the error induced by the sparsification for a variety of downstream tasks (e. g., SSL).

Graph Learning

Optimistic optimization of a Brownian

no code implementations NeurIPS 2018 Jean-bastien Grill, Michal Valko, Rémi Munos

Given $W$, our goal is to return an $\epsilon$-approximation of its maximum using the smallest possible number of function evaluations, the sample complexity of the algorithm.

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits

no code implementations11 Feb 2019 Pierre Perrault, Vianney Perchet, Michal Valko

We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}.

Online A-Optimal Design and Active Linear Regression

no code implementations20 Jun 2019 Xavier Fontaine, Pierre Perrault, Michal Valko, Vianney Perchet

By trying to minimize the $\ell^2$-loss $\mathbb{E} [\lVert\hat{\beta}-\beta^{\star}\rVert^2]$ the decision maker is actually minimizing the trace of the covariance matrix of the problem, which corresponds then to online A-optimal design.

regression

Derivative-Free & Order-Robust Optimisation

no code implementations9 Oct 2019 Victor Gabillon, Rasul Tutunov, Michal Valko, Haitham Bou Ammar

In this paper, we formalise order-robust optimisation as an instance of online learning minimising simple regret, and propose Vroom, a zero'th order optimisation algorithm capable of achieving vanishing regret in non-stationary environments, while recovering favorable rates under stochastic reward-generating processes.

Fixed-Confidence Guarantees for Bayesian Best-Arm Identification

no code implementations24 Oct 2019 Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko

We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS).

Thompson Sampling

No-Regret Exploration in Goal-Oriented Reinforcement Learning

no code implementations ICML 2020 Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric

Many popular reinforcement learning problems (e. g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost.

Atari Games reinforcement-learning +1

Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards

no code implementations14 Apr 2020 Aadirupa Saha, Pierre Gaillard, Michal Valko

We then study the most general version of the problem where at each round available sets are generated from some unknown arbitrary distribution (i. e., without the independence assumption) and propose an efficient algorithm with $O(\sqrt {2^K T})$ regret guarantee.

Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

no code implementations NeurIPS 2020 Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, Michal Valko

We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support.

Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits

no code implementations NeurIPS 2020 Pierre Perrault, Etienne Boursier, Vianney Perchet, Michal Valko

In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family.

Thompson Sampling

Adaptive Reward-Free Exploration

no code implementations11 Jun 2020 Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel.

Stochastic bandits with arm-dependent delays

no code implementations ICML 2020 Anne Gael Manegueu, Claire Vernade, Alexandra Carpentier, Michal Valko

Significant work has been recently dedicated to the stochastic delayed bandit setting because of its relevance in applications.

Sampling from a $k$-DPP without looking at all items

no code implementations30 Jun 2020 Daniele Calandriello, Michał Dereziński, Michal Valko

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more.

Active Learning Point Processes +1

Gamification of Pure Exploration for Linear Bandits

no code implementations ICML 2020 Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko

We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits.

Experimental Design

A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

no code implementations9 Jul 2020 Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric.

reinforcement-learning Reinforcement Learning (RL)

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

no code implementations NeurIPS 2021 Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.

reinforcement-learning Reinforcement Learning (RL)

Budgeted Online Influence Maximization

no code implementations ICML 2020 Pierre Perrault, Zheng Wen, Michal Valko, Jennifer Healey

We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set.

valid

Improved Sleeping Bandits with Stochastic Action Sets and Adversarial Rewards

no code implementations ICML 2020 Aadirupa Saha, Pierre Gaillard, Michal Valko

The best existing efficient (i. e., polynomial-time) algorithms for this problem only guarantee a $O(T^{2/3})$ upper-bound on the regret.

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

no code implementations7 Oct 2020 Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode.

reinforcement-learning Reinforcement Learning (RL)

On the Approximation Relationship between Optimizing Ratio of Submodular (RS) and Difference of Submodular (DS) Functions

no code implementations5 Jan 2021 Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko

We demonstrate that from an algorithm guaranteeing an approximation factor for the ratio of submodular (RS) optimization problem, we can build another algorithm having a different kind of approximation guarantee -- weaker than the classical one -- for the difference of submodular (DS) optimization problem, and vice versa.

Data Structures and Algorithms

Revisiting Peng's Q($λ$) for Modern Reinforcement Learning

no code implementations27 Feb 2021 Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel

These results indicate that Peng's Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

Continuous Control reinforcement-learning +1

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

no code implementations NeurIPS 2021 Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

no code implementations11 Jun 2021 Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko

We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play.

Taylor Expansion of Discount Factors

no code implementations11 Jun 2021 Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko

In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective.

reinforcement-learning Reinforcement Learning (RL)

Learning in two-player zero-sum partially observable Markov games with perfect recall

no code implementations NeurIPS 2021 Tadashi Kozuno, Pierre Ménard, Remi Munos, Michal Valko

We study the problem of learning a Nash equilibrium (NE) in an extensive game with imperfect information (EGII) through self-play.

Adaptive Multi-Goal Exploration

no code implementations23 Nov 2021 Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric

We introduce a generic strategy for provably efficient multi-goal exploration.

Marginalized Operators for Off-policy Reinforcement Learning

no code implementations30 Mar 2022 Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko

We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases.

Off-policy evaluation reinforcement-learning

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

no code implementations16 May 2022 Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits.

Multi-Armed Bandits

Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments

no code implementations18 Nov 2022 Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko

In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome -- which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics.

Montezuma's Revenge

Unlocking the Power of Representations in Long-term Novelty-based Exploration

no code implementations2 May 2023 Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot

We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space.

Atari Games Clustering +1

VA-learning as a more efficient alternative to Q-learning

no code implementations29 May 2023 Yunhao Tang, Rémi Munos, Mark Rowland, Michal Valko

In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function.

Q-Learning

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

no code implementations29 May 2023 Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Rémi Munos, Bernardo Ávila Pires, Michal Valko

Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings.

Local and adaptive mirror descents in extensive-form games

no code implementations1 Sep 2023 Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback.

Demonstration-Regularized RL

no code implementations26 Oct 2023 Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning.

reinforcement-learning Reinforcement Learning (RL)

Generalized Preference Optimization: A Unified Approach to Offline Alignment

no code implementations8 Feb 2024 Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot

Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices.

Human Alignment of Large Language Models through Online Preference Optimisation

no code implementations13 Mar 2024 Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.

Cannot find the paper you are looking for? You can Submit a new open access paper.