no code implementations • 15 May 2023 • Dirk van der Hoeven, Lukas Zierahn, Tal Lancewicki, Aviv Rosenberg, Nicoló Cesa-Bianchi
We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback.
no code implementations • 13 May 2023 • Tal Lancewicki, Aviv Rosenberg, Dmitry Sotnikov
Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL).
no code implementations • 28 Jul 2022 • Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour
Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence.
no code implementations • 31 Jan 2022 • Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately.
no code implementations • 31 Jan 2022 • Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
We study cooperative online learning in stochastic and adversarial Markov decision process (MDP).
no code implementations • 4 Jun 2021 • Tal Lancewicki, Shahar Segal, Tomer Koren, Yishay Mansour
We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm.
no code implementations • 29 Dec 2020 • Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of $\widetilde O ( \sqrt{K} + \sqrt{D} )$ under full-information feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay.