Search Results for author: Mikita Balesni

Found 3 papers, 3 papers with code

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

1 code implementation9 Nov 2023 Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.

Management

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

2 code implementations21 Sep 2023 Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A".

Data Augmentation Sentence

Cannot find the paper you are looking for? You can Submit a new open access paper.