Search Results for author: Florian Tramer

Found 18 papers, 11 papers with code

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

2 code implementations28 Mar 2024 Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github. com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench. github. io/ that tracks the performance of attacks and defenses for various LLMs.

Are aligned neural networks adversarially aligned?

no code implementations NeurIPS 2023 Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt

We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.

Increasing Confidence in Adversarial Robustness Evaluations

no code implementations28 Jun 2022 Roland S. Zimmermann, Wieland Brendel, Florian Tramer, Nicholas Carlini

Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations.

Adversarial Robustness

(Certified!!) Adversarial Robustness for Free!

1 code implementation21 Jun 2022 Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, MingJie Sun, J. Zico Kolter

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models.

Adversarial Robustness Denoising

Debugging Differential Privacy: A Case Study for Privacy Auditing

no code implementations24 Feb 2022 Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, Nicholas Carlini

Differential Privacy can provide provable privacy guarantees for training data in machine learning.

Quantifying Memorization Across Neural Language Models

2 code implementations15 Feb 2022 Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim.

Fairness Memorization

Membership Inference Attacks From First Principles

3 code implementations7 Dec 2021 Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset.

Inference Attack Membership Inference Attack

Data Poisoning Won’t Save You From Facial Recognition

no code implementations ICML Workshop AML 2021 Evani Radiya-Dixit, Florian Tramer

We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for all before being published and scraped, and must thereafter fool all future models---including models trained adaptively against the users' past attacks, or models that use technologies discovered after the attack.

Data Poisoning

Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

no code implementations ICML Workshop AML 2021 Florian Tramer

We prove a general \emph{hardness reduction} between detection and classification of adversarial examples: given a robust detector for attacks at distance $\epsilon$ (in some metric), we can build a similarly robust (but inefficient) \emph{classifier} for attacks at distance $\epsilon/2$.

Extracting Training Data from Large Language Models

3 code implementations14 Dec 2020 Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.

Language Modelling

Label-Only Membership Inference Attacks

1 code implementation28 Jul 2020 Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences.

L2 Regularization

On Adaptive Attacks to Adversarial Example Defenses

4 code implementations NeurIPS 2020 Florian Tramer, Nicholas Carlini, Wieland Brendel, Aleksander Madry

Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples.

SquirRL: Automating Attack Discovery on Blockchain Incentive Mechanisms with Deep Reinforcement Learning

1 code implementation4 Dec 2019 Charlie Hou, Mingxun Zhou, Yan Ji, Phil Daian, Florian Tramer, Giulia Fanti, Ari Juels

Incentive mechanisms are central to the functionality of permissionless blockchains: they incentivize participants to run and secure the underlying consensus protocol.

Cryptography and Security

Physical Adversarial Examples for Object Detectors

no code implementations20 Jul 2018 Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramer, Atul Prakash, Tadayoshi Kohno, Dawn Song

In this work, we extend physical attacks to more challenging object detection models, a broader class of deep learning algorithms widely used to detect and label multiple objects within a scene.

Object object-detection +1

Note on Attacking Object Detectors with Adversarial Stickers

no code implementations21 Dec 2017 Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Dawn Song, Tadayoshi Kohno, Amir Rahmati, Atul Prakash, Florian Tramer

Given the fact that state-of-the-art objection detection algorithms are harder to be fooled by the same set of adversarial examples, here we show that these detectors can also be attacked by physical adversarial examples.

Deep Learning Object

Cannot find the paper you are looking for? You can Submit a new open access paper.