Search Results for author: Mikita Balesni

Found 3 papers, 3 papers with code

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

1 code implementation • 9 Nov 2023 • Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.

Management

Paper
Code

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

2 code implementations • 21 Sep 2023 • Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A".

Data Augmentation Sentence

246

Paper
Code

Taken out of context: On measuring situational awareness in LLMs

1 code implementation • 1 Sep 2023 • Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

At test time, we assess whether the model can pass the test.

Data Augmentation In-Context Learning

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.