Search Results for author: Aengus Lynch

Found 8 papers, 5 papers with code

How Do Large Language Monkeys Get Their Power (Laws)?

no code implementations24 Feb 2025 Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts.

Language Modeling Language Modelling +1

Best-of-N Jailbreaking

1 code implementation4 Dec 2024 John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts.

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

2 code implementations22 Jul 2024 Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless.

Model Editing Red Teaming

Analyzing the Generalization and Reliability of Steering Vectors

1 code implementation17 Jul 2024 Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution.

Language Modeling Language Modelling

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations26 Feb 2024 Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Towards Automated Circuit Discovery for Mechanistic Interpretability

4 code implementations NeurIPS 2023 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases

2 code implementations9 Mar 2023 Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, Ricardo Silva

The problem of spurious correlations (SCs) arises when a classifier relies on non-predictive features that happen to be correlated with the labels in the training data.

Image Captioning Image Classification

Causal Machine Learning: A Survey and Open Problems

no code implementations30 Jun 2022 Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva

Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM).

BIG-bench Machine Learning Fairness +2

Cannot find the paper you are looking for? You can Submit a new open access paper.