Search Results for author: Csaba Szepesvari

Found 70 papers, 4 papers with code

A simpler approach to accelerated optimization: iterative averaging meets optimism

no code implementations ICML 2020 Pooria Joulani, Anant Raj, András György, Csaba Szepesvari

In this paper, we show that there is a simpler approach to obtaining accelerated rates: applying generic, well-known optimistic online learning algorithms and using the online average of their predictions to query the (deterministic or stochastic) first-order optimization oracle at each time step.

Stochastic Gradient Succeeds for Bandits

no code implementations27 Feb 2024 Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans

We show that the \emph{stochastic gradient} bandit algorithm converges to a \emph{globally optimal} policy at an $O(1/t)$ rate, even with a \emph{constant} step size.

The Role of Baselines in Policy Gradient Optimization

no code implementations16 Jan 2023 Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance.

Towards Painless Policy Optimization for Constrained MDPs

1 code implementation11 Apr 2022 Arushi Jain, Sharan Vaswani, Reza Babanezhad, Csaba Szepesvari, Doina Precup

We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems.

Understanding the Effect of Stochasticity in Policy Optimization

no code implementations NeurIPS 2021 Jincheng Mei, Bo Dai, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions.

The Curse of Passive Data Collection in Batch Reinforcement Learning

no code implementations18 Jun 2021 Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari

In high stake applications, active experimentation may be considered too risky and thus data are often collected passively.

reinforcement-learning Reinforcement Learning (RL)

On Multi-objective Policy Optimization as a Tool for Reinforcement Learning: Case Studies in Offline RL and Finetuning

no code implementations15 Jun 2021 Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari, Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan, Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell, Nicolas Heess, Martin Riedmiller

Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step.

Offline RL reinforcement-learning +1

Leveraging Non-uniformity in First-order Non-convex Optimization

no code implementations13 May 2021 Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Classical global convergence results for first-order methods rely on uniform smoothness and the \L{}ojasiewicz inequality.

BIG-bench Machine Learning

On the Optimality of Batch Policy Optimization Algorithms

no code implementations6 Apr 2021 Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li, Csaba Szepesvari, Dale Schuurmans

First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis.

Value prediction

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

no code implementations25 Feb 2021 Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari

We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation.

On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

no code implementations NeurIPS 2021 Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, Mengdi Wang

By assuming the overparameterizaiton of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global $\epsilon$-optimal policy with $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.

Reinforcement Learning (RL)

Optimistic Policy Optimization with General Function Approximations

no code implementations1 Jan 2021 Qi Cai, Zhuoran Yang, Csaba Szepesvari, Zhaoran Wang

Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms.

reinforcement-learning Reinforcement Learning (RL)

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

no code implementations15 Dec 2020 Dongruo Zhou, Quanquan Gu, Csaba Szepesvari

Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting.

reinforcement-learning Reinforcement Learning (RL)

Escaping the Gravitational Pull of Softmax

no code implementations NeurIPS 2020 Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li, Csaba Szepesvari, Dale Schuurmans

Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities.

Differentiable Meta-Learning of Bandit Policies

no code implementations NeurIPS 2020 Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form.


Variational Policy Gradient Method for Reinforcement Learning with General Utilities

no code implementations NeurIPS 2020 Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, Mengdi Wang

Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function.

reinforcement-learning Reinforcement Learning (RL) +1

PAC-Bayes Analysis Beyond the Usual Bounds

no code implementations NeurIPS 2020 Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvari, John Shawe-Taylor

Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds.


Meta-Learning Bandit Policies by Gradient Ascent

no code implementations9 Jun 2020 Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, Craig Boutilier

Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters.

Meta-Learning Multi-Armed Bandits

Model-Based Reinforcement Learning with Value-Targeted Regression

no code implementations ICML 2020 Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, Lin F. Yang

We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed.

Model-based Reinforcement Learning regression +2

On the Global Convergence Rates of Softmax Policy Gradient Methods

no code implementations ICML 2020 Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization.

Open-Ended Question Answering Policy Gradient Methods

Model Selection in Contextual Stochastic Bandit Problems

no code implementations NeurIPS 2020 Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, Csaba Szepesvari

Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal $O(\sqrt{T})$ model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee.

Model Selection Multi-Armed Bandits

Differentiable Bandit Exploration

no code implementations NeurIPS 2020 Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer

In this work, we learn such policies for an unknown distribution $\mathcal{P}$ using samples from $\mathcal{P}$.


Adaptive Approximate Policy Iteration

1 code implementation8 Feb 2020 Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, Csaba Szepesvari

This is an improvement over the best existing bound of $\tilde{O}(T^{3/4})$ for the average-reward case with function approximation.

Think out of the "Box": Generically-Constrained Asynchronous Composite Optimization and Hedging

no code implementations NeurIPS 2019 Pooria Joulani, András György, Csaba Szepesvari

ASYNCADA is, to our knowledge, the first asynchronous stochastic optimization algorithm with finite-time data-dependent convergence guarantees for generic convex constraints.

Stochastic Optimization

Learning with Good Feature Representations in Bandits and in RL with a Generative Model

no code implementations ICML 2020 Tor Lattimore, Csaba Szepesvari, Gellert Weisz

The construction by Du et al. (2019) implies that even if a learner is given linear features in $\mathbb R^d$ that approximate the rewards in a bandit with a uniform error of $\epsilon$, then searching for an action that is optimal up to $O(\epsilon)$ requires examining essentially all actions.

Autonomous exploration for navigating in non-stationary CMPs

no code implementations18 Oct 2019 Pratik Gajane, Ronald Ortner, Peter Auer, Csaba Szepesvari

We consider a setting in which the objective is to learn to navigate in a controlled Markov process (CMP) where transition probabilities may abruptly change.


Adaptive Exploration in Linear Contextual Bandit

no code implementations15 Oct 2019 Botao Hao, Tor Lattimore, Csaba Szepesvari

Contextual bandits serve as a fundamental model for many sequential decision making tasks.

Decision Making Multi-Armed Bandits

Exploration-Enhanced POLITEX

no code implementations27 Aug 2019 Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, Gellert Weisz

POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment.

PAC-Bayes with Backprop

no code implementations19 Aug 2019 Omar Rivasplata, Vikram M Tankasali, Csaba Szepesvari

We explore the family of methods "PAC-Bayes with Backprop" (PBB) to train probabilistic neural networks by minimizing PAC-Bayes bounds.

Exploration by Optimisation in Partial Monitoring

no code implementations12 Jul 2019 Tor Lattimore, Csaba Szepesvari

We provide a simple and efficient algorithm for adversarial $k$-action $d$-outcome non-degenerate locally observable partial monitoring game for which the $n$-round minimax regret is bounded by $6(d+1) k^{3/2} \sqrt{n \log(k)}$, matching the best known information-theoretic upper bound.

Randomized Exploration in Generalized Linear Bandits

no code implementations21 Jun 2019 Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier

The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution.

Empirical Bayes Regret Minimization

no code implementations4 Apr 2019 Chih-Wei Hsu, Branislav Kveton, Ofer Meshi, Martin Mladenov, Csaba Szepesvari

In this work, we pioneer the idea of algorithm design by minimizing the empirical Bayes regret, the average regret over problem instances sampled from a known distribution.

An Exponential Efron-Stein Inequality for Lq Stable Learning Rules

no code implementations12 Mar 2019 Karim Abou-Moustafa, Csaba Szepesvari

In particular, the main question we address here is \emph{whether it is possible to derive exponential generalization bounds for the estimated risk using a notion of stability that is computationally tractable and distribution dependent, but weaker than uniform stability.

Generalization Bounds

Perturbed-History Exploration in Stochastic Multi-Armed Bandits

no code implementations26 Feb 2019 Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier

Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

Multi-Armed Bandits

An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

no code implementations1 Feb 2019 Tor Lattimore, Csaba Szepesvari

We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary.

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

no code implementations13 Nov 2018 Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, Tor Lattimore

Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards.

Multi-Armed Bandits

LeapsAndBounds: A Method for Approximately Optimal Algorithm Configuration

no code implementations ICML 2018 Gellert Weisz, Andras Gyorgy, Csaba Szepesvari

We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution.

BubbleRank: Safe Online Learning to Re-Rank via Implicit Click Feedback

no code implementations15 Jun 2018 Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, Masrour Zoghi

In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists.

Learning-To-Rank Re-Ranking +1

TopRank: A practical algorithm for online stochastic ranking

no code implementations NeurIPS 2018 Tor Lattimore, Branislav Kveton, Shuai Li, Csaba Szepesvari

Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks from the user.

Decision Making Learning-To-Rank +1

Cleaning up the neighborhood: A full classification for adversarial partial monitoring

no code implementations23 May 2018 Tor Lattimore, Csaba Szepesvari

Partial monitoring is a generalization of the well-known multi-armed bandit framework where the loss is not directly observed by the learner.

General Classification

Model-Free Linear Quadratic Control via Reduction to Expert Prediction

no code implementations17 Apr 2018 Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari

Model-free approaches for reinforcement learning (RL) and continuous control find policies based only on past states and rewards, without fitting a model of the system dynamics.

Continuous Control Reinforcement Learning (RL)

Multi-view Matrix Factorization for Linear Dynamical System Estimation

no code implementations NeurIPS 2017 Mahdi Karami, Martha White, Dale Schuurmans, Csaba Szepesvari

In this paper, we instead reconsider likelihood maximization and develop an optimization based strategy for recovering the latent states and transition parameters.

Bandits with Delayed, Aggregated Anonymous Feedback

no code implementations ICML 2018 Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, Steffen Grunewalder

In this problem, when the player pulls an arm, a reward is generated, however it is not immediately observed.

Crowdsourcing with Sparsely Interacting Workers

no code implementations20 Jun 2017 Yao Ma, Alex Olshevsky, Venkatesh Saligrama, Csaba Szepesvari

We then formulate a weighted rank-one optimization problem to estimate skills based on observations on an irreducible, aperiodic interaction graph.

Binary Classification Matrix Completion

An a Priori Exponential Tail Bound for k-Folds Cross-Validation

no code implementations19 Jun 2017 Karim Abou-Moustafa, Csaba Szepesvari

Next, under some reasonable notion of stability, we use this exponential tail bound to analyze the concentration of the k-fold cross-validation (KFCV) estimate around the true risk of a hypothesis generated by a general learning rule.

Generalization Bounds valid

Sequential Learning without Feedback

no code implementations18 Oct 2016 Manjesh Hanawal, Csaba Szepesvari, Venkatesh Saligrama

We reduce USS to a special case of multi-armed bandit problem with side information and develop polynomial time algorithms that achieve sublinear regret.

The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

no code implementations14 Oct 2016 Tor Lattimore, Csaba Szepesvari

Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications.

reinforcement-learning Reinforcement Learning (RL) +1

Stochastic Rank-1 Bandits

no code implementations10 Aug 2016 Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, Zheng Wen

The main challenge of the problem is that the individual values of the row and column are unobserved.

Linear Multi-Resource Allocation with Semi-Bandit Feedback

no code implementations NeurIPS 2015 Tor Lattimore, Koby Crammer, Csaba Szepesvari

In each time step the learner chooses an allocation of several resource types between a number of tasks.

Learning with a Strong Adversary

1 code implementation10 Nov 2015 Ruitong Huang, Bing Xu, Dale Schuurmans, Csaba Szepesvari

The robustness of neural networks to intended perturbations has recently attracted significant attention.

General Classification

Combinatorial Cascading Bandits

no code implementations NeurIPS 2015 Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari

The agent observes the index of the first chosen item whose weight is zero.

Cascading Bandits: Learning to Rank in the Cascade Model

no code implementations10 Feb 2015 Branislav Kveton, Csaba Szepesvari, Zheng Wen, Azin Ashkan

We also prove gap-dependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits.


Universal Option Models

no code implementations NeurIPS 2014 Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar

We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

no code implementations3 Oct 2014 Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff.

On Minimax Optimal Offline Policy Evaluation

no code implementations12 Sep 2014 Lihong Li, Remi Munos, Csaba Szepesvari

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.

Multi-Armed Bandits Off-policy evaluation

Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods

no code implementations20 Jun 2012 Gergely Neu, Csaba Szepesvari

In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem.

reinforcement-learning Reinforcement Learning (RL)

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

no code implementations13 Jun 2012 Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling

Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.

PAC-Bayesian Policy Evaluation for Reinforcement Learning

no code implementations14 Feb 2012 Mahdi Milani Fard, Joelle Pineau, Csaba Szepesvari

PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution.

Model Selection reinforcement-learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.