Search Results for author: Stephen Casper

Found 22 papers, 11 papers with code

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations26 Feb 2024 Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

no code implementations6 Nov 2023 Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.

Language Modelling

Measuring the Success of Diffusion Models at Imitating Human Artists

no code implementations8 Jul 2023 Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81. 0%.

Image Classification Image Generation

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

3 code implementations15 Jun 2023 Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

1 code implementation18 Nov 2022 Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

2 code implementations5 Sep 2022 Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.

reinforcement-learning Reinforcement Learning (RL)

Quantifying Local Specialization in Deep Neural Networks

3 code implementations13 Oct 2021 Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell

These results suggest that graph-based partitioning can reveal local specialization and that statistical methods can be used to automatedly screen for sets of neurons that can be understood abstractly.

Robust Feature-Level Adversaries are Interpretability Tools

2 code implementations7 Oct 2021 Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.

Detecting Modularity in Deep Neural Networks

no code implementations29 Sep 2021 Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell

These results suggest that graph-based partitioning can reveal modularity and help us understand how deep neural networks function.

Clusterability in Neural Networks

2 code implementations4 Mar 2021 Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

We also exhibit novel methods to promote clusterability in neural network training, and find that in multi-layer perceptrons they lead to more clusterable networks with little reduction in accuracy.

Importance and Coherence: Methods for Evaluating Modularity in Neural Networks

no code implementations1 Jan 2021 Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell

We apply these methods on partitionings generated by a spectral clustering algorithm which uses a graph representation of the network's neurons and weights.

Clustering

Achilles Heels for AGI/ASI via Decision Theoretic Adversaries

no code implementations12 Oct 2020 Stephen Casper

As progress in AI continues to advance, it is important to know how advanced systems will make choices and in what ways they may fail.

Probing Neural Dialog Models for Conversational Understanding

1 code implementation WS 2020 Abdelrhman Saleh, Tovly Deutsch, Stephen Casper, Yonatan Belinkov, Stuart Shieber

The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets.

Open-Domain Dialog

Frivolous Units: Wider Networks Are Not Really That Wide

1 code implementation10 Dec 2019 Stephen Casper, Xavier Boix, Vanessa D'Amario, Ling Guo, Martin Schrimpf, Kasper Vinken, Gabriel Kreiman

We identify two distinct types of "frivolous" units that proliferate when the network's width is increased: prunable units which can be dropped out of the network without significant change to the output and redundant units whose activities can be expressed as a linear combination of others.

Cannot find the paper you are looking for? You can Submit a new open access paper.