Search Results for author: Chen-Yu Wei

Found 40 papers, 3 papers with code

Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

no code implementations11 Nov 2024 Haolin Liu, Zakaria Mhammedi, Chen-Yu Wei, Julian Zimmert

First, we improve the $poly(d, A, H)T^{5/6}$ regret bound of Zhao et al. (2024) to $poly(d, A, H)T^{2/3}$ for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes.

How Does Variance Shape the Regret in Contextual Bandits?

no code implementations16 Oct 2024 Zeyu Jia, Jian Qian, Alexander Rakhlin, Chen-Yu Wei

We show that a regret of $\Omega(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$ is unavoidable when $\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}$.

Multi-Armed Bandits

Corruption-Robust Linear Bandits: Minimax Optimality and Gap-Dependent Misspecification

no code implementations10 Oct 2024 Haolin Liu, Artin Tajdini, Andrew Wagenmaker, Chen-Yu Wei

In this work, we compare two types of corruptions commonly considered: strong corruption, where the corruption level depends on the action chosen by the learner, and weak corruption, where the corruption level does not depend on the action chosen by the learner.

Offline Reinforcement Learning: Role of State Aggregation and Trajectory Data

no code implementations25 Mar 2024 Zeyu Jia, Alexander Rakhlin, Ayush Sekhari, Chen-Yu Wei

We revisit the problem of offline reinforcement learning with value function realizability but without Bellman completeness.

reinforcement-learning

On Tractable $Φ$-Equilibria in Non-Concave Games

no code implementations13 Mar 2024 Yang Cai, Constantinos Daskalakis, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

While Online Gradient Descent and other no-regret learning procedures are known to efficiently converge to a coarse correlated equilibrium in games where each agent's utility is concave in their own strategy, this is not the case when utilities are non-concave -- a common scenario in machine learning applications involving strategies parameterized by deep neural networks, or when agents' utilities are computed by neural networks, or both.

Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games

no code implementations26 Jan 2024 Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium.

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

no code implementations17 Oct 2023 Haolin Liu, Chen-Yu Wei, Julian Zimmert

The first algorithm, although computationally inefficient, ensures a regret of $\widetilde{\mathcal{O}}\left(\sqrt{K}\right)$, where $K$ is the number of episodes.

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

no code implementations NeurIPS 2023 Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Alejandro Ribeiro

To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.

A Blackbox Approach to Best of Both Worlds in Bandits and Beyond

no code implementations20 Feb 2023 Christoph Dann, Chen-Yu Wei, Julian Zimmert

Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently.

Multi-Armed Bandits

Best of Both Worlds Policy Optimization

no code implementations18 Feb 2023 Christoph Dann, Chen-Yu Wei, Julian Zimmert

Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.

Refined Regret for Adversarial MDPs with Linear Function Approximation

no code implementations30 Jan 2023 Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest.

A Unified Algorithm for Stochastic Path Problems

no code implementations17 Oct 2022 Christoph Dann, Chen-Yu Wei, Julian Zimmert

Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards.

Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence

no code implementations8 Feb 2022 Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Mihailo R. Jovanović

When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size.

Multi-agent Reinforcement Learning Policy Gradient Methods +1

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

no code implementations1 Nov 2021 Hsu Kao, Chen-Yu Wei, Vijay Subramanian

For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of $\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the leader and the follower, respectively, and $T$ is the number of steps.

Multi-agent Reinforcement Learning Multi-Armed Bandits +3

A Model Selection Approach for Corruption Robust Reinforcement Learning

no code implementations7 Oct 2021 Chen-Yu Wei, Christoph Dann, Julian Zimmert

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward.

Model Selection Multi-Armed Bandits +4

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

no code implementations NeurIPS 2021 Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee

When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

no code implementations10 Feb 2021 Chen-Yu Wei, Haipeng Luo

Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.

Multi-Armed Bandits reinforcement-learning +1

Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

no code implementations8 Feb 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play.

Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications

no code implementations1 Feb 2021 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

no code implementations7 Dec 2020 Liyu Chen, Haipeng Luo, Chen-Yu Wei

We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

no code implementations23 Jul 2020 Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.

Linear Last-iterate Convergence in Constrained Saddle-point Optimization

1 code implementation ICLR 2021 Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo

Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption.

Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

no code implementations NeurIPS 2020 Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary.

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

no code implementations8 Jun 2020 Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation.

Q-Learning reinforcement-learning +1

Federated Residual Learning

no code implementations28 Mar 2020 Alekh Agarwal, John Langford, Chen-Yu Wei

We study a new form of federated learning where the clients train personalized local models and make predictions jointly with the server-side shared model.

Federated Learning

Adversarial Online Learning with Changing Action Sets: Efficient Algorithms with Approximate Regret Bounds

no code implementations7 Mar 2020 Ehsan Emamjomeh-Zadeh, Chen-Yu Wei, Haipeng Luo, David Kempe

We revisit the problem of online learning with sleeping experts/bandits: in each time step, only a subset of the actions are available for the algorithm to choose from (and learn about).

PAC learning

Taking a hint: How to leverage loss predictors in contextual bandits?

no code implementations4 Mar 2020 Chen-Yu Wei, Haipeng Luo, Alekh Agarwal

We initiate the study of learning in contextual bandits with the help of loss predictors.

Multi-Armed Bandits

Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

no code implementations2 Oct 2019 James A. Preiss, Sébastien M. R. Arnold, Chen-Yu Wei, Marius Kloft

We study the variance of the REINFORCE policy gradient estimator in environments with continuous state and action spaces, linear dynamics, quadratic cost, and Gaussian noise.

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case

no code implementations6 Feb 2019 Alina Beygelzimer, Dávid Pál, Balázs Szörényi, Devanathan Thiruvenkatachari, Chen-Yu Wei, Chicheng Zhang

Under the more challenging weak linear separability condition, we design an efficient algorithm with a mistake bound of $\min (2^{\widetilde{O}(K \log^2 (1/\gamma))}, 2^{\widetilde{O}(\sqrt{1/\gamma} \log K)})$.

Classification General Classification

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

no code implementations3 Feb 2019 Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.

Multi-Armed Bandits

Improved Path-length Regret Bounds for Bandits

no code implementations29 Jan 2019 Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei

We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

no code implementations25 Jan 2019 Julian Zimmert, Haipeng Luo, Chen-Yu Wei

We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$.

Efficient Online Portfolio with Logarithmic Regret

no code implementations NeurIPS 2018 Haipeng Luo, Chen-Yu Wei, Kai Zheng

We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation.

Management

More Adaptive Algorithms for Adversarial Bandits

no code implementations10 Jan 2018 Chen-Yu Wei, Haipeng Luo

We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem).

Tracking the Best Expert in Non-stationary Stochastic Environments

no code implementations NeurIPS 2016 Chen-Yu Wei, Yi-Te Hong, Chi-Jen Lu

We study the dynamic regret of multi-armed bandit and experts problem in non-stationary stochastic environments.

Efficient Contextual Bandits in Non-stationary Worlds

no code implementations5 Aug 2017 Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford

In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d.

Multi-Armed Bandits

Cannot find the paper you are looking for? You can Submit a new open access paper.