no code implementations • 31 Jan 2025 • Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
1 code implementation • 4 Nov 2024 • Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan
In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs.
1 code implementation • 2 Nov 2024 • Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper
While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks.
no code implementations • 8 Mar 2021 • Simon Akar, Gowtham Atluri, Thomas Boettcher, Michael Peters, Henry Schreiner, Michael Sokoloff, Marian Stahl, William Tepe, Constantin Weisser, Mike Williams
The locations of proton-proton collision points in LHC experiments are called primary vertices (PVs).
High Energy Physics - Experiment Data Analysis, Statistics and Probability
1 code implementation • 19 Oct 2020 • Ouail Kitouni, Benjamin Nachman, Constantin Weisser, Mike Williams
A key challenge in searches for resonant new physics is that classifiers trained to enhance potential signals must not induce localized structures.
High Energy Physics - Phenomenology High Energy Physics - Experiment Data Analysis, Statistics and Probability