Search Results for author: Haipeng Luo

Found 74 papers, 7 papers with code

Towards Minimax Online Learning with Unknown Time Horizon

no code implementations31 Jul 2013 Haipeng Luo, Robert E. Schapire

We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some known distribution, and the other which allows the adversary full control over the horizon.

A Drifting-Games Analysis for Online Learning and Applications to Boosting

no code implementations NeurIPS 2014 Haipeng Luo, Robert E. Schapire

Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games.

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

no code implementations25 Nov 2014 Haipeng Luo, Patrick Haffner, Jean-Francois Paiement

The growing amount of high dimensional data in different machine learning applications requires more efficient and scalable optimization algorithms.

BIG-bench Machine Learning

Optimal and Adaptive Algorithms for Online Boosting

no code implementations9 Feb 2015 Alina Beygelzimer, Satyen Kale, Haipeng Luo

We study online boosting, the task of converting any weak online learner into a strong online learner.

Achieving All with No Parameters: Adaptive NormalHedge

no code implementations20 Feb 2015 Haipeng Luo, Robert E. Schapire

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information.

Online Gradient Boosting

no code implementations NeurIPS 2015 Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo

We extend the theory of boosting for regression problems to the online learning setting.

regression

Fast Convergence of Regularized Learning in Games

no code implementations NeurIPS 2015 Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire

We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games.

Variance-Reduced and Projection-Free Stochastic Optimization

no code implementations5 Feb 2016 Elad Hazan, Haipeng Luo

The Frank-Wolfe optimization algorithm has recently regained popularity for machine learning applications due to its projection-free property and its ability to handle structured constraints.

Stochastic Optimization

Efficient Second Order Online Learning by Sketching

no code implementations NeurIPS 2016 Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, John Langford

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data.

Oracle-Efficient Online Learning and Auction Design

no code implementations5 Nov 2016 Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, Jennifer Wortman Vaughan

We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle.

Corralling a Band of Bandit Algorithms

1 code implementation19 Dec 2016 Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire

We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own.

Multi-Armed Bandits

Efficient Contextual Bandits in Non-stationary Worlds

no code implementations5 Aug 2017 Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford

In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d.

Multi-Armed Bandits

More Adaptive Algorithms for Adversarial Bandits

no code implementations10 Jan 2018 Chen-Yu Wei, Haipeng Luo

We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem).

Practical Contextual Bandits with Regression Oracles

no code implementations ICML 2018 Dylan J. Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, Robert E. Schapire

A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded.

General Classification Multi-Armed Bandits +1

Logistic Regression: The Importance of Being Improper

no code implementations25 Mar 2018 Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Starting with the simple observation that the logistic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm.

regression

Efficient Online Portfolio with Logarithmic Regret

no code implementations NeurIPS 2018 Haipeng Luo, Chen-Yu Wei, Kai Zheng

We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation.

Management

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

no code implementations25 Jan 2019 Julian Zimmert, Haipeng Luo, Chen-Yu Wei

We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$.

Improved Path-length Regret Bounds for Bandits

no code implementations29 Jan 2019 Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei

We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

no code implementations3 Feb 2019 Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.

Multi-Armed Bandits

Hypothesis Set Stability and Generalization

no code implementations NeurIPS 2019 Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Our main result is a generalization bound for data-dependent hypothesis sets expressed in terms of a notion of hypothesis set stability and a notion of Rademacher complexity for data-dependent hypothesis sets that we introduce.

Equipping Experts/Bandits with Long-term Memory

no code implementations NeurIPS 2019 Kai Zheng, Haipeng Luo, Ilias Diakonikolas, Li-Wei Wang

We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret.

Multi-Armed Bandits

Model selection for contextual bandits

1 code implementation NeurIPS 2019 Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo

We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension $d_1 < d_2 < \ldots$, where the $m^\star$-th class contains the optimal policy, and we design an algorithm that achieves $\tilde{O}(T^{2/3}d^{1/3}_{m^\star})$ regret with no prior knowledge of the optimal dimension $d_{m^\star}$.

Model Selection Multi-Armed Bandits

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

no code implementations3 Dec 2019 Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.

Fair Contextual Multi-Armed Bandits: Theory and Experiments

no code implementations13 Dec 2019 Yifang Chen, Alex Cuellar, Haipeng Luo, Jignesh Modi, Heramb Nemlekar, Stefanos Nikolaidis

We introduce a Multi-Armed Bandit algorithm with fairness constraints, where fairness is defined as a minimum rate that a task or a resource is assigned to a user.

Decision Making Fairness +1

A Closer Look at Small-loss Bounds for Bandits with Graph Feedback

no code implementations2 Feb 2020 Chung-Wei Lee, Haipeng Luo, Mengxiao Zhang

We study small-loss bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds.

Multi-Armed Bandits

Taking a hint: How to leverage loss predictors in contextual bandits?

no code implementations4 Mar 2020 Chen-Yu Wei, Haipeng Luo, Alekh Agarwal

We initiate the study of learning in contextual bandits with the help of loss predictors.

Multi-Armed Bandits

Adversarial Online Learning with Changing Action Sets: Efficient Algorithms with Approximate Regret Bounds

no code implementations7 Mar 2020 Ehsan Emamjomeh-Zadeh, Chen-Yu Wei, Haipeng Luo, David Kempe

We revisit the problem of online learning with sleeping experts/bandits: in each time step, only a subset of the actions are available for the algorithm to choose from (and learn about).

PAC learning

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

no code implementations8 Jun 2020 Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation.

Q-Learning reinforcement-learning +1

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

no code implementations NeurIPS 2020 Tiancheng Jin, Haipeng Luo

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback.

Multi-Armed Bandits

Active Online Domain Adaptation

no code implementations ICML Workshop LifelongML 2020 Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang

We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains.

Online Domain Adaptation regression

Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

no code implementations NeurIPS 2020 Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary.

Linear Last-iterate Convergence in Constrained Saddle-point Optimization

1 code implementation ICLR 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption.

Open Problem: Model Selection for Contextual Bandits

no code implementations19 Jun 2020 Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo

In statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence.

Model Selection Multi-Armed Bandits

Active Online Learning with Hidden Shifting Domains

no code implementations25 Jun 2020 Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang

We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains.

Domain Adaptation regression

Comparator-adaptive Convex Bandits

no code implementations NeurIPS 2020 Dirk van der Hoeven, Ashok Cutkosky, Haipeng Luo

We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart.

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

no code implementations23 Jul 2020 Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

no code implementations7 Dec 2020 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.

Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications

no code implementations1 Feb 2021 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.

Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

no code implementations8 Feb 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

no code implementations10 Feb 2021 Chen-Yu Wei, Haipeng Luo

Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.

Multi-Armed Bandits reinforcement-learning +1

Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case

no code implementations10 Feb 2021 Liyu Chen, Haipeng Luo

Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.

The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition

no code implementations NeurIPS 2021 Tiancheng Jin, Longbo Huang, Haipeng Luo

We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the losses are adversarial and simultaneously $\mathcal{O}(\text{polylog}(T))$ regret when the losses are (almost) stochastic.

Open-Ended Question Answering

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

no code implementations9 Jun 2021 Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo

We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state.

reinforcement-learning Reinforcement Learning (RL)

Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path

no code implementations NeurIPS 2021 Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, Haipeng Luo

We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured.

Last-iterate Convergence in Extensive-Form Games

no code implementations NeurIPS 2021 Chung-Wei Lee, Christian Kroer, Haipeng Luo

Inspired by recent advances on last-iterate convergence of optimistic algorithms in zero-sum normal-form games, we study this phenomenon in sequential games, and provide a comprehensive study of last-iterate convergence for zero-sum extensive-form games with perfect recall (EFGs), using various optimistic regret-minimization algorithms over treeplexes.

counterfactual

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

no code implementations NeurIPS 2021 Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee

When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

No-Regret Learning in Time-Varying Zero-Sum Games

no code implementations30 Jan 2022 Mengxiao Zhang, Peng Zhao, Haipeng Luo, Zhi-Hua Zhou

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning.

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

no code implementations31 Jan 2022 Liyu Chen, Rahul Jain, Haipeng Luo

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints.

Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games

no code implementations1 Feb 2022 Gabriele Farina, Chung-Wei Lee, Haipeng Luo, Christian Kroer

In this paper we show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick.

Policy Optimization for Stochastic Shortest Path

no code implementations7 Feb 2022 Liyu Chen, Haipeng Luo, Aviv Rosenberg

Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees.

reinforcement-learning Reinforcement Learning (RL)

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

no code implementations12 Feb 2022 Haipeng Luo, Mengxiao Zhang, Peng Zhao, Zhi-Hua Zhou

The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon.

Adaptive Bandit Convex Optimization with Heterogeneous Curvature

no code implementations12 Feb 2022 Haipeng Luo, Mengxiao Zhang, Peng Zhao

We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them.

Uncoupled Learning Dynamics with $O(\log T)$ Swap Regret in Multiplayer Games

no code implementations25 Apr 2022 Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Tuomas Sandholm

In this paper we establish efficient and \emph{uncoupled} learning dynamics so that, when employed by all players in a general-sum multiplayer game, the \emph{swap regret} of each player after $T$ repetitions of the game is bounded by $O(\log T)$, improving over the prior best bounds of $O(\log^4 (T))$.

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

no code implementations25 May 2022 Liyu Chen, Haipeng Luo

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions.

reinforcement-learning Reinforcement Learning (RL)

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

no code implementations26 May 2022 Yan Dai, Haipeng Luo, Liyu Chen

More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022).

Near-Optimal No-Regret Learning Dynamics for General Convex Games

no code implementations17 Jun 2022 Gabriele Farina, Ioannis Anagnostides, Haipeng Luo, Chung-Wei Lee, Christian Kroer, Tuomas Sandholm

In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets.

Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs

no code implementations4 Oct 2022 Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang

For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$.

Multi-Armed Bandits

No-Regret Learning in Two-Echelon Supply Chain with Unknown Demand Distribution

no code implementations23 Oct 2022 Mengxiao Zhang, Shi Chen, Haipeng Luo, Yingfei Wang

Supply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms' SCM strategies.

Management

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

4 code implementations CVPR 2023 Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

Data Augmentation Retrieval +2

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

5 code implementations CVPR 2023 Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.

Action Classification Action Recognition +3

Refined Regret for Adversarial MDPs with Linear Function Approximation

no code implementations30 Jan 2023 Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest.

Average-Constrained Policy Optimization

no code implementations2 Feb 2023 Akhil Agnihotri, Rahul Jain, Haipeng Luo

In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion.

Reinforcement Learning (RL)

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

1 code implementation18 Aug 2023 Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, JianGuang Lou, Chongyang Tao, Xiubo Geng, QIngwei Lin, Shifeng Chen, Dongmei Zhang

Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model.

Ranked #49 on Arithmetic Reasoning on GSM8K (using extra training data)

Arithmetic Reasoning GSM8K +2

Online Learning in Contextual Second-Price Pay-Per-Click Auctions

no code implementations8 Oct 2023 Mengxiao Zhang, Haipeng Luo

We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads and needs to make an estimate on their click-through rate (CTR) in order to run a second-price pay-per-click auction.

Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games

no code implementations1 Nov 2023 Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Weiqiang Zheng

Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice.

Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games

no code implementations26 Jan 2024 Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium.

Contextual Multinomial Logit Bandits with General Value Functions

no code implementations12 Feb 2024 Mengxiao Zhang, Haipeng Luo

Contextual multinomial logit (MNL) bandits capture many real-world assortment recommendation problems such as online retailing/advertising.

Computational Efficiency Multi-Armed Bandits

Efficient Contextual Bandits with Uninformed Feedback Graphs

no code implementations12 Feb 2024 Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

Bandits with feedback graphs are powerful online learning models that interpolate between the full information and classic bandit problems, capturing many real-life applications.

Multi-Armed Bandits regression

Tractable Local Equilibria in Non-Concave Games

no code implementations13 Mar 2024 Yang Cai, Constantinos Daskalakis, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

While Online Gradient Descent and other no-regret learning procedures are known to efficiently converge to coarse correlated equilibrium in games where each agent's utility is concave in their own strategy, this is not the case when the utilities are non-concave, a situation that is common in machine learning applications where the agents' strategies are parameterized by deep neural networks, or the agents' utilities are computed by a neural network, or both.

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

no code implementations ICML 2020 Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.

Cannot find the paper you are looking for? You can Submit a new open access paper.