1 code implementation • 26 Oct 2024 • Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette
The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences.
1 code implementation • 29 Feb 2024 • Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar
In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e. g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively.
no code implementations • 24 Feb 2024 • Ruiqi Zhang, Yuexiang Zhai, Andrea Zanette
Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one.
no code implementations • 10 Nov 2022 • Andrea Zanette
Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met.
no code implementations • 1 Jun 2022 • Andrea Zanette, Martin J. Wainwright
Such instability can be observed even with linear function approximation.
no code implementations • 24 Mar 2022 • Andrea Zanette, Martin J. Wainwright
We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions.
no code implementations • NeurIPS 2021 • Andrea Zanette, Martin J. Wainwright, Emma Brunskill
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
no code implementations • NeurIPS 2021 • Andrea Zanette, Kefan Dong, Jonathan Lee, Emma Brunskill
In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired.
no code implementations • 24 Mar 2021 • Andrea Zanette, Ching-An Cheng, Alekh Agarwal
Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.
no code implementations • 14 Dec 2020 • Andrea Zanette
Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration.
no code implementations • NeurIPS 2020 • Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill
There has been growing progress on theoretical analyses for provably efficient learning in MDPs with linear function approximation, but much of the existing work has made strong assumptions to enable exploration by conventional exploration frameworks.
no code implementations • ICML 2020 • Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting.
no code implementations • NeurIPS 2019 • Andrea Zanette, Mykel J. Kochenderfer, Emma Brunskill
This paper focuses on the problem of computing an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model.
no code implementations • NeurIPS 2019 • Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill
We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points.
no code implementations • ICML 2018 • Andrea Zanette, Emma Brunskill
In order to make good decision under uncertainty an agent must learn from observations.
2 code implementations • 1 Nov 2019 • Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL).
no code implementations • 1 Jan 2019 • Andrea Zanette, Emma Brunskill
Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict.
no code implementations • 25 Nov 2018 • Andrea Zanette, Junzi Zhang, Mykel J. Kochenderfer
This paper focuses on the problem of determining as large a region as possible where a function exceeds a given threshold with high probability.