no code implementations • 2 Sep 2023 • Haolin Liu, Chen-Yu Wei, Julian Zimmert
We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i. e. the context) is drawn from a fixed distribution.
no code implementations • 20 Jun 2023 • Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Alejandro Ribeiro
To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
no code implementations • 27 May 2023 • Tiancheng Jin, Junyan Liu, Chloé Rouyer, William Chang, Chen-Yu Wei, Haipeng Luo
Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed.
no code implementations • 1 May 2023 • Julia Olkhovskaya, Jack Mayo, Tim van Erven, Gergely Neu, Chen-Yu Wei
We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction.
no code implementations • 5 Mar 2023 • Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\mathcal{O}(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$.
no code implementations • 20 Feb 2023 • Christoph Dann, Chen-Yu Wei, Julian Zimmert
Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently.
no code implementations • 18 Feb 2023 • Christoph Dann, Chen-Yu Wei, Julian Zimmert
Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.
no code implementations • 30 Jan 2023 • Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert
This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest.
no code implementations • 17 Oct 2022 • Christoph Dann, Chen-Yu Wei, Julian Zimmert
Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards.
1 code implementation • 10 Feb 2022 • Alberto Bietti, Chen-Yu Wei, Miroslav Dudík, John Langford, Zhiwei Steven Wu
Large-scale machine learning systems often involve data distributed across a collection of users.
no code implementations • 8 Feb 2022 • Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Mihailo R. Jovanović
When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size.
Multi-agent Reinforcement Learning
Policy Gradient Methods
+1
no code implementations • 1 Nov 2021 • Hsu Kao, Chen-Yu Wei, Vijay Subramanian
For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of $\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the leader and the follower, respectively, and $T$ is the number of steps.
no code implementations • 7 Oct 2021 • Chen-Yu Wei, Christoph Dann, Julian Zimmert
We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward.
no code implementations • NeurIPS 2021 • Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee
When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.
no code implementations • 11 Feb 2021 • Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, Xiaojin Zhang
In this work, we develop linear bandit algorithms that automatically adapt to different environments.
no code implementations • 10 Feb 2021 • Chen-Yu Wei, Haipeng Luo
Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.
no code implementations • 8 Feb 2021 • Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo
We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play.
no code implementations • 1 Feb 2021 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.
no code implementations • 7 Dec 2020 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.
no code implementations • 23 Jul 2020 • Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
1 code implementation • ICLR 2021 • Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo
Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption.
no code implementations • NeurIPS 2020 • Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang
We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary.
no code implementations • 8 Jun 2020 • Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation.
no code implementations • 28 Mar 2020 • Alekh Agarwal, John Langford, Chen-Yu Wei
We study a new form of federated learning where the clients train personalized local models and make predictions jointly with the server-side shared model.
no code implementations • 7 Mar 2020 • Ehsan Emamjomeh-Zadeh, Chen-Yu Wei, Haipeng Luo, David Kempe
We revisit the problem of online learning with sleeping experts/bandits: in each time step, only a subset of the actions are available for the algorithm to choose from (and learn about).
no code implementations • 4 Mar 2020 • Chen-Yu Wei, Haipeng Luo, Alekh Agarwal
We initiate the study of learning in contextual bandits with the help of loss predictors.
1 code implementation • ICML 2020 • Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems.
no code implementations • 2 Oct 2019 • James A. Preiss, Sébastien M. R. Arnold, Chen-Yu Wei, Marius Kloft
We study the variance of the REINFORCE policy gradient estimator in environments with continuous state and action spaces, linear dynamics, quadratic cost, and Gaussian noise.
no code implementations • 6 Feb 2019 • Alina Beygelzimer, Dávid Pál, Balázs Szörényi, Devanathan Thiruvenkatachari, Chen-Yu Wei, Chicheng Zhang
Under the more challenging weak linear separability condition, we design an efficient algorithm with a mistake bound of $\min (2^{\widetilde{O}(K \log^2 (1/\gamma))}, 2^{\widetilde{O}(\sqrt{1/\gamma} \log K)})$.
no code implementations • 3 Feb 2019 • Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.
no code implementations • 29 Jan 2019 • Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei
We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.
no code implementations • 25 Jan 2019 • Julian Zimmert, Haipeng Luo, Chen-Yu Wei
We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$.
no code implementations • NeurIPS 2018 • Haipeng Luo, Chen-Yu Wei, Kai Zheng
We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover's Universal Portfolio algorithm and admits much faster implementation.
no code implementations • 10 Jan 2018 • Chen-Yu Wei, Haipeng Luo
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem).
no code implementations • NeurIPS 2016 • Chen-Yu Wei, Yi-Te Hong, Chi-Jen Lu
We study the dynamic regret of multi-armed bandit and experts problem in non-stationary stochastic environments.
no code implementations • NeurIPS 2017 • Chen-Yu Wei, Yi-Te Hong, Chi-Jen Lu
We study online reinforcement learning in average-reward stochastic games (SGs).
no code implementations • 5 Aug 2017 • Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford
In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d.