no code implementations • 29 Feb 2024 • Zitian Li, Wang Chi Cheung
Motivated by the cost heterogeneity in experimentation across different alternatives, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem.
no code implementations • 8 Feb 2023 • Lixing Lyu, Wang Chi Cheung
Finally, we adapt our model to a network revenue management problem, and numerically demonstrate that our algorithm can still performs competitively compared to existing baselines.
no code implementations • 16 Oct 2021 • Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan
We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon.
no code implementations • 29 Sep 2021 • Wang Chi Cheung, Zi Yi Ewe
We consider reinforcement learning with vectorial rewards, where the agent receives a vector of $K\geq 2$ different types of rewards at each time step.
1 code implementation • 15 Oct 2020 • Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan
When the amount of corruptions per step (CPS) is below a threshold, PSS($u$) identifies the best arm or item with probability tending to $1$ as $T\rightarrow \infty$.
no code implementations • ICML 2020 • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i. e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets.
no code implementations • ICML 2020 • Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan
Finally, extensive numerical simulations corroborate the efficacy of CascadeBAI as well as the tightness of our upper bound on its time complexity.
1 code implementation • NeurIPS 2019 • Wang Chi Cheung
We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round.
no code implementations • 7 Jun 2019 • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
Notably, the interplay between endogeneity and exogeneity presents a unique challenge, absent in existing (stationary and non-stationary) stochastic online learning settings, when we apply the conventional Optimism in Face of Uncertainty principle to design algorithms with provably low dynamic regret for RL in drifting MDPs.
no code implementations • 15 May 2019 • Wang Chi Cheung
In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes.
no code implementations • 4 Mar 2019 • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, we can further enjoy the (nearly) optimal dynamic regret bounds in a (surprisingly) parameter-free manner.
no code implementations • 11 Oct 2018 • Wang Chi Cheung, Will Ma, David Simchi-Levi, Xinshang Wang
We overcome both the challenges of model uncertainty and customer heterogeneity by judiciously synthesizing two algorithmic frameworks from the literature: inventory balancing, which "reserves" a portion of each resource for high-reward customer types which could later arrive, and online learning, which shows how to "explore" the resource consumption distributions of each customer type under different actions.
no code implementations • 6 Oct 2018 • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandit setting.
no code implementations • 2 Oct 2018 • Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan
While Thompson sampling (TS) algorithms have been shown to be empirically superior to Upper Confidence Bound (UCB) algorithms for cascading bandits, theoretical guarantees are only known for the latter.
no code implementations • 1 Apr 2017 • Wang Chi Cheung, David Simchi-Levi
We first propose an efficient online policy which incurs a regret $\tilde{O}(T^{2/3})$, where $T$ is the number of customers in the sales horizon.