1 code implementation • 29 Aug 2024 • Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, Yutaka Matsuo
Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\widetilde{\mathcal{O}}(\varepsilon^{-4})$ robust policy evaluations.
no code implementations • 31 May 2021 • Paavo Parmas, Masashi Sugiyama
Reparameterization (RP) and likelihood ratio (LR) gradient estimators are used to estimate gradients of expectations throughout machine learning and reinforcement learning; however, they are usually explained as simple mathematical tricks, with no insight into their nature.
no code implementations • 14 Oct 2019 • Paavo Parmas, Masashi Sugiyama
Reparameterization (RP) and likelihood ratio (LR) gradient estimators are used throughout machine and reinforcement learning; however, they are usually explained as simple mathematical tricks without providing any insight into their nature.
1 code implementation • 1 Jun 2019 • Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Duenez-Guzman, Karl Tuyls
Policy gradient and actor-critic algorithms form the basis of many commonly used training techniques in deep reinforcement learning.
no code implementations • NeurIPS 2018 • Paavo Parmas
Backpropagation and the chain rule of derivatives have been prominent; however, the total derivative rule has not enjoyed the same amount of attention.
3 code implementations • ICML 2018 • Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya
Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization.
Model-based Reinforcement Learning
reinforcement-learning
+2