1 code implementation • 6 Dec 2023 • Zheqing Zhu, Rodrigo de Salvo Braz, Jalaj Bhandari, Daniel Jiang, Yi Wan, Yonathan Efroni, Liyuan Wang, Ruiyang Xu, Hongbo Guo, Alex Nikulkov, Dmytro Korenkevych, Urun Dogan, Frank Cheng, Zheng Wu, Wanqiao Xu
Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals.
no code implementations • 23 May 2023 • Ruiyang Xu, Jalaj Bhandari, Dmytro Korenkevych, Fan Liu, Yuchen He, Alex Nikulkov, Zheqing Zhu
Auction-based recommender systems are prevalent in online advertising platforms, but they are typically optimized to allocate recommendation slots based on immediate expected return metrics, neglecting the downstream effects of recommendations on user behavior.
no code implementations • 21 Jul 2020 • Jalaj Bhandari, Daniel Russo
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact gradient evaluations.
no code implementations • 5 Jun 2019 • Jalaj Bhandari, Daniel Russo
Policy gradients methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices.
no code implementations • 6 Jun 2018 • Jalaj Bhandari, Daniel Russo, Raghav Singal
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process.