Search Results for author: Jan Wehner

Found 2 papers, 0 papers with code

Immunization against harmful fine-tuning attacks

no code implementations26 Feb 2024 Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz

Approaches to aligning large language models (LLMs) with human values has focused on correcting misalignment that emerges from pretraining.

Explaining Learned Reward Functions with Counterfactual Trajectories

no code implementations7 Feb 2024 Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert

Finally, we measure how informative the generated explanations are to a proxy-human model by training it on CTEs.

counterfactual

Cannot find the paper you are looking for? You can Submit a new open access paper.