Search Results for author: Stephen Casper

Found 22 papers, 11 papers with code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

no code implementations • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Add Code

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

1 code implementation • 3 Apr 2024 • Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

Interpretability techniques are valuable for helping humans understand and oversee AI systems.

Paper
Code

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

1 code implementation • 8 Mar 2024 • Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors.

Image Classification text-classification +2

Paper
Code

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations • 26 Feb 2024 • Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Paper
Add Code

Rethinking Machine Unlearning for Large Language Models

no code implementations • 13 Feb 2024 • Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

We explore machine unlearning (MU) in the domain of large language models (LLMs), referred to as LLM unlearning.

Machine Unlearning Management +2

Paper
Add Code

Black-Box Access is Insufficient for Rigorous AI Audits

no code implementations • 25 Jan 2024 • Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

The effectiveness of an audit, however, depends on the degree of system access granted to auditors.

Paper
Add Code

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

1 code implementation • 27 Nov 2023 • Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.

Language Modelling

Paper
Code

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

no code implementations • 6 Nov 2023 • Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.

Language Modelling

Paper
Add Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

Measuring the Success of Diffusion Models at Imitating Human Artists

no code implementations • 8 Jul 2023 • Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81. 0%.

Image Classification Image Generation

Paper
Add Code

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

3 code implementations • 15 Jun 2023 • Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.

Paper
Code

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

1 code implementation • 18 Nov 2022 • Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.

Paper
Code

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

2 code implementations • 5 Sep 2022 • Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.

reinforcement-learning Reinforcement Learning (RL)

Paper
Code

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

no code implementations • 27 Jul 2022 • Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell

The last decade of machine learning has seen drastic increases in scale and capabilities.

Adversarial Robustness Benchmarking +1

Paper
Add Code

Quantifying Local Specialization in Deep Neural Networks

3 code implementations • 13 Oct 2021 • Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell

These results suggest that graph-based partitioning can reveal local specialization and that statistical methods can be used to automatedly screen for sets of neurons that can be understood abstractly.

Paper
Code

Robust Feature-Level Adversaries are Interpretability Tools

2 code implementations • 7 Oct 2021 • Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.

Paper
Code

Detecting Modularity in Deep Neural Networks

no code implementations • 29 Sep 2021 • Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell

These results suggest that graph-based partitioning can reveal modularity and help us understand how deep neural networks function.

Paper
Add Code

Clusterability in Neural Networks

2 code implementations • 4 Mar 2021 • Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

We also exhibit novel methods to promote clusterability in neural network training, and find that in multi-layer perceptrons they lead to more clusterable networks with little reduction in accuracy.

Paper
Code

Importance and Coherence: Methods for Evaluating Modularity in Neural Networks

no code implementations • 1 Jan 2021 • Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell

We apply these methods on partitionings generated by a spectral clustering algorithm which uses a graph representation of the network's neurons and weights.

Clustering

Paper
Add Code

Achilles Heels for AGI/ASI via Decision Theoretic Adversaries

no code implementations • 12 Oct 2020 • Stephen Casper

As progress in AI continues to advance, it is important to know how advanced systems will make choices and in what ways they may fail.

Paper
Add Code

Probing Neural Dialog Models for Conversational Understanding

1 code implementation • WS 2020 • Abdelrhman Saleh, Tovly Deutsch, Stephen Casper, Yonatan Belinkov, Stuart Shieber

The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets.

Open-Domain Dialog

Paper
Code

Frivolous Units: Wider Networks Are Not Really That Wide

1 code implementation • 10 Dec 2019 • Stephen Casper, Xavier Boix, Vanessa D'Amario, Ling Guo, Martin Schrimpf, Kasper Vinken, Gabriel Kreiman

We identify two distinct types of "frivolous" units that proliferate when the network's width is increased: prunable units which can be dropped out of the network without significant change to the output and redundant units whose activities can be expressed as a linear combination of others.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.