Search Results for author: Andrea Zanette

Found 17 papers, 2 papers with code

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

1 code implementation29 Feb 2024 Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar

In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e. g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively.

Language Modelling Reinforcement Learning (RL)

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

no code implementations24 Feb 2024 Ruiqi Zhang, Yuexiang Zhai, Andrea Zanette

Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one.

Decision Making Multi-Armed Bandits

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

no code implementations10 Nov 2022 Andrea Zanette

Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met.

reinforcement-learning Reinforcement Learning (RL)

Bellman Residual Orthogonalization for Offline Reinforcement Learning

no code implementations24 Mar 2022 Andrea Zanette, Martin J. Wainwright

We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions.

Offline RL Off-policy evaluation +1

Design of Experiments for Stochastic Contextual Linear Bandits

no code implementations NeurIPS 2021 Andrea Zanette, Kefan Dong, Jonathan Lee, Emma Brunskill

In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired.

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

no code implementations24 Mar 2021 Andrea Zanette, Ching-An Cheng, Alekh Agarwal

Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.

reinforcement-learning Reinforcement Learning (RL)

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

no code implementations14 Dec 2020 Andrea Zanette

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration.

reinforcement-learning Reinforcement Learning (RL)

Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration

no code implementations NeurIPS 2020 Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill

There has been growing progress on theoretical analyses for provably efficient learning in MDPs with linear function approximation, but much of the existing work has made strong assumptions to enable exploration by conventional exploration frameworks.

Learning Near Optimal Policies with Low Inherent Bellman Error

no code implementations ICML 2020 Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill

This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting.

reinforcement-learning Reinforcement Learning (RL)

Limiting Extrapolation in Linear Approximate Value Iteration

no code implementations NeurIPS 2019 Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill

We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points.

Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model

no code implementations NeurIPS 2019 Andrea Zanette, Mykel J. Kochenderfer, Emma Brunskill

This paper focuses on the problem of computing an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model.

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

no code implementations1 Jan 2019 Andrea Zanette, Emma Brunskill

Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict.

Learning Theory Reinforcement Learning (RL)

Robust Super-Level Set Estimation using Gaussian Processes

no code implementations25 Nov 2018 Andrea Zanette, Junzi Zhang, Mykel J. Kochenderfer

This paper focuses on the problem of determining as large a region as possible where a function exceeds a given threshold with high probability.

Gaussian Processes

Cannot find the paper you are looking for? You can Submit a new open access paper.