Search Results for author: Nikolaus H. R. Howe

Found 2 papers, 1 papers with code

Defining and Characterizing Reward Hacking

no code implementations27 Sep 2022 Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$.

Cannot find the paper you are looking for? You can Submit a new open access paper.