Search Results for author: Haipeng Luo

Found 49 papers, 3 papers with code

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

no code implementations ICML 2020 Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

no code implementations18 Jul 2021 Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee

When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

Last-iterate Convergence in Extensive-Form Games

no code implementations27 Jun 2021 Chung-Wei Lee, Christian Kroer, Haipeng Luo

Inspired by recent advances on last-iterate convergence of optimistic algorithms in zero-sum normal-form games, we study this phenomenon in sequential games, and provide a comprehensive study of last-iterate convergence for zero-sum extensive-form games with perfect recall (EFGs), using various optimistic regret-minimization algorithms over treeplexes.

Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path

no code implementations15 Jun 2021 Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, Haipeng Luo

We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured.

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

no code implementations9 Jun 2021 Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo

We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state.

The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition

no code implementations8 Jun 2021 Tiancheng Jin, Longbo Huang, Haipeng Luo

We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the losses are adversarial and simultaneously $\mathcal{O}(\text{polylog}(T))$ regret when the losses are (almost) stochastic.

Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

no code implementations10 Feb 2021 Liyu Chen, Haipeng Luo

Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

no code implementations10 Feb 2021 Chen-Yu Wei, Haipeng Luo

Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.

Multi-Armed Bandits

Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

no code implementations8 Feb 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play.

Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications

no code implementations1 Feb 2021 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

no code implementations7 Dec 2020 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

no code implementations23 Jul 2020 Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.

Comparator-adaptive Convex Bandits

no code implementations NeurIPS 2020 Dirk van der Hoeven, Ashok Cutkosky, Haipeng Luo

We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart.

Active Online Learning with Hidden Shifting Domains

no code implementations25 Jun 2020 Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang

We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains.

Domain Adaptation

Open Problem: Model Selection for Contextual Bandits

no code implementations19 Jun 2020 Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo

In statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence.

Model Selection Multi-Armed Bandits

Linear Last-iterate Convergence in Constrained Saddle-point Optimization

1 code implementation ICLR 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption.

Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

no code implementations NeurIPS 2020 Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary.

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

no code implementations NeurIPS 2020 Tiancheng Jin, Haipeng Luo

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback.

Multi-Armed Bandits

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

no code implementations8 Jun 2020 Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation.

Q-Learning

Adversarial Online Learning with Changing Action Sets: Efficient Algorithms with Approximate Regret Bounds

no code implementations7 Mar 2020 Ehsan Emamjomeh-Zadeh, Chen-Yu Wei, Haipeng Luo, David Kempe

We revisit the problem of online learning with sleeping experts/bandits: in each time step, only a subset of the actions are available for the algorithm to choose from (and learn about).

Taking a hint: How to leverage loss predictors in contextual bandits?

no code implementations4 Mar 2020 Chen-Yu Wei, Haipeng Luo, Alekh Agarwal

We initiate the study of learning in contextual bandits with the help of loss predictors.

Multi-Armed Bandits

A Closer Look at Small-loss Bounds for Bandits with Graph Feedback

no code implementations2 Feb 2020 Chung-Wei Lee, Haipeng Luo, Mengxiao Zhang

We study small-loss bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds.

Multi-Armed Bandits

Fair Contextual Multi-Armed Bandits: Theory and Experiments

no code implementations13 Dec 2019 Yifang Chen, Alex Cuellar, Haipeng Luo, Jignesh Modi, Heramb Nemlekar, Stefanos Nikolaidis

We introduce a Multi-Armed Bandit algorithm with fairness constraints, where fairness is defined as a minimum rate that a task or a resource is assigned to a user.

Decision Making Fairness +1

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

no code implementations3 Dec 2019 Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.

Model selection for contextual bandits

1 code implementation NeurIPS 2019 Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo

We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension $d_1 < d_2 < \ldots$, where the $m^\star$-th class contains the optimal policy, and we design an algorithm that achieves $\tilde{O}(T^{2/3}d^{1/3}_{m^\star})$ regret with no prior knowledge of the optimal dimension $d_{m^\star}$.

Model Selection Multi-Armed Bandits

Equipping Experts/Bandits with Long-term Memory

no code implementations NeurIPS 2019 Kai Zheng, Haipeng Luo, Ilias Diakonikolas, Li-Wei Wang

We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret.

Multi-Armed Bandits

Hypothesis Set Stability and Generalization

no code implementations NeurIPS 2019 Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Our main result is a generalization bound for data-dependent hypothesis sets expressed in terms of a notion of hypothesis set stability and a notion of Rademacher complexity for data-dependent hypothesis sets that we introduce.

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

no code implementations3 Feb 2019 Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.

Multi-Armed Bandits

Improved Path-length Regret Bounds for Bandits

no code implementations29 Jan 2019 Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei

We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

no code implementations25 Jan 2019 Julian Zimmert, Haipeng Luo, Chen-Yu Wei

We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$.

Efficient Online Portfolio with Logarithmic Regret

no code implementations NeurIPS 2018 Haipeng Luo, Chen-Yu Wei, Kai Zheng

We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation.

Logistic Regression: The Importance of Being Improper

no code implementations25 Mar 2018 Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm.

Practical Contextual Bandits with Regression Oracles

no code implementations ICML 2018 Dylan J. Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, Robert E. Schapire

A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded.

General Classification Multi-Armed Bandits

More Adaptive Algorithms for Adversarial Bandits

no code implementations10 Jan 2018 Chen-Yu Wei, Haipeng Luo

We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem).

Efficient Contextual Bandits in Non-stationary Worlds

no code implementations5 Aug 2017 Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford

In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d.

Multi-Armed Bandits

Corralling a Band of Bandit Algorithms

no code implementations19 Dec 2016 Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire

We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own.

Multi-Armed Bandits

Oracle-Efficient Online Learning and Auction Design

no code implementations5 Nov 2016 Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, Jennifer Wortman Vaughan

We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle.

Efficient Second Order Online Learning by Sketching

no code implementations NeurIPS 2016 Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, John Langford

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data.

Variance-Reduced and Projection-Free Stochastic Optimization

no code implementations5 Feb 2016 Elad Hazan, Haipeng Luo

The Frank-Wolfe optimization algorithm has recently regained popularity for machine learning applications due to its projection-free property and its ability to handle structured constraints.

Stochastic Optimization

Fast Convergence of Regularized Learning in Games

no code implementations NeurIPS 2015 Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire

We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games.

Online Gradient Boosting

no code implementations NeurIPS 2015 Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo

We extend the theory of boosting for regression problems to the online learning setting.

Achieving All with No Parameters: Adaptive NormalHedge

no code implementations20 Feb 2015 Haipeng Luo, Robert E. Schapire

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information.

Optimal and Adaptive Algorithms for Online Boosting

no code implementations9 Feb 2015 Alina Beygelzimer, Satyen Kale, Haipeng Luo

We study online boosting, the task of converting any weak online learner into a strong online learner.

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

no code implementations25 Nov 2014 Haipeng Luo, Patrick Haffner, Jean-Francois Paiement

The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms.

A Drifting-Games Analysis for Online Learning and Applications to Boosting

no code implementations NeurIPS 2014 Haipeng Luo, Robert E. Schapire

Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games.

Towards Minimax Online Learning with Unknown Time Horizon

no code implementations31 Jul 2013 Haipeng Luo, Robert E. Schapire

We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some known distribution, and the other which allows the adversary full control over the horizon.

Cannot find the paper you are looking for? You can Submit a new open access paper.