no code implementations • 20 Jul 2020 • Denis Denisov, Neil Walton
We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards.