no code implementations • 25 Mar 2024 • Titouan Renard, Andreas Schlaginhaufen, Tingting Ni, Maryam Kamgarpour
Furthermore, with $\mathcal{O}(1/\varepsilon^{4})$ samples we prove that the optimal policy corresponding to the recovered reward is $\varepsilon$-close to the expert policy in total variation distance.