Search Results for author: Maluna Menke

Found 2 papers, 1 papers with code

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

no code implementations12 Feb 2025 Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff

Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting.

Cannot find the paper you are looking for? You can Submit a new open access paper.