Search Results for author: Nicholas Carlini

Found 79 papers, 41 papers with code

Universal and Transferable Adversarial Attacks on Aligned Language Models

11 code implementations27 Jul 2023 Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer).

Adversarial Attack

Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

1 code implementation9 Sep 2023 Daphne Ippolito, Nicholas Carlini, Katherine Lee, Milad Nasr, Yun William Yu

Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text.

Text Generation

Deduplicating Training Data Makes Language Models Better

1 code implementation ACL 2022 Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini

As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.

Language Modelling Sentence

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

4 code implementations ICML 2018 Anish Athalye, Nicholas Carlini, David Wagner

We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples.

Adversarial Attack Adversarial Defense

Towards Evaluating the Robustness of Neural Networks

26 code implementations16 Aug 2016 Nicholas Carlini, David Wagner

Defensive distillation is a recently proposed approach that can take an arbitrary neural network, and increase its robustness, reducing the success rate of current attacks' ability to find adversarial examples from $95\%$ to $0. 5\%$.

Adversarial Attack

Membership Inference Attacks From First Principles

2 code implementations7 Dec 2021 Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset.

Inference Attack Membership Inference Attack

Unrestricted Adversarial Examples

1 code implementation22 Sep 2018 Tom B. Brown, Nicholas Carlini, Chiyuan Zhang, Catherine Olsson, Paul Christiano, Ian Goodfellow

We introduce a two-player contest for evaluating the safety and robustness of machine learning systems, with a large prize pool.

BIG-bench Machine Learning

Quantifying Memorization Across Neural Language Models

2 code implementations15 Feb 2022 Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim.

Fairness Memorization

Extracting Training Data from Large Language Models

3 code implementations14 Dec 2020 Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.

Language Modelling

ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring

1 code implementation ICLR 2020 David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, Colin Raffel

We improve the recently-proposed ``MixMatch semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring.

On Adaptive Attacks to Adversarial Example Defenses

4 code implementations NeurIPS 2020 Florian Tramer, Nicholas Carlini, Wieland Brendel, Aleksander Madry

Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples.

Label-Only Membership Inference Attacks

1 code implementation28 Jul 2020 Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences.

L2 Regularization

AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation

5 code implementations ICLR 2022 David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, Alex Kurakin

We extend semi-supervised learning to the problem of domain adaptation to learn significantly higher-accuracy models that train on one data distribution and test on a different one.

Semi-supervised Domain Adaptation Unsupervised Domain Adaptation

Provably Minimally-Distorted Adversarial Examples

1 code implementation29 Sep 2017 Nicholas Carlini, Guy Katz, Clark Barrett, David L. Dill

Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4. 2.

Cryptanalytic Extraction of Neural Network Models

1 code implementation10 Mar 2020 Nicholas Carlini, Matthew Jagielski, Ilya Mironov

We argue that the machine learning problem of model extraction is actually a cryptanalytic problem in disguise, and should be studied as such.

Model extraction

(Certified!!) Adversarial Robustness for Free!

1 code implementation21 Jun 2022 Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, MingJie Sun, J. Zico Kolter

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models.

Adversarial Robustness Denoising

Poisoning and Backdooring Contrastive Learning

1 code implementation ICLR 2022 Nicholas Carlini, Andreas Terzis

Multimodal contrastive learning methods like CLIP train on noisy and uncurated training datasets.

Contrastive Learning

Considerations for Differentially Private Learning with Large-Scale Public Pretraining

2 code implementations13 Dec 2022 Florian Tramèr, Gautam Kamath, Nicholas Carlini

The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets.

Privacy Preserving Transfer Learning

Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

1 code implementation ICLR 2022 Oliver Bryniarski, Nabeel Hingun, Pedro Pachuca, Vincent Wang, Nicholas Carlini

Evading adversarial example detection defenses requires finding adversarial examples that must simultaneously (a) be misclassified by the model and (b) be detected as non-adversarial.

Part-Based Models Improve Adversarial Robustness

1 code implementation15 Sep 2022 Chawin Sitawarin, Kornrapat Pongmala, Yizheng Chen, Nicholas Carlini, David Wagner

We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification.

Adversarial Robustness

Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems

1 code implementation7 Oct 2022 Chawin Sitawarin, Florian Tramèr, Nicholas Carlini

Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries.

Evading Black-box Classifiers Without Breaking Eggs

1 code implementation5 Jun 2023 Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

We then design new attacks that reduce the number of bad queries by $1. 5$-$7. 3\times$, but often at a significant increase in total (non-bad) queries.

Stateful Detection of Black-Box Adversarial Attacks

1 code implementation12 Jul 2019 Steven Chen, Nicholas Carlini, David Wagner

This is true even when, as is the case in many practical settings, the classifier is hosted as a remote service and so the adversary does not have direct access to the model parameters.

Defensive Distillation is Not Robust to Adversarial Examples

1 code implementation14 Jul 2016 Nicholas Carlini, David Wagner

We show that defensive distillation is not secure: it is no more resistant to targeted misclassification attacks than unprotected neural networks.

Data Poisoning Won't Save You From Facial Recognition

1 code implementation28 Jun 2021 Evani Radiya-Dixit, Sanghyun Hong, Nicholas Carlini, Florian Tramèr

We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for all before being published (at which point they are scraped) and must thereafter fool all future models -- including models trained adaptively against the users' past attacks, or models that use technologies discovered after the attack.

Data Poisoning

MagNet and "Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples

1 code implementation22 Nov 2017 Nicholas Carlini, David Wagner

MagNet and "Efficient Defenses..." were recently proposed as a defense to adversarial examples.

On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses

2 code implementations10 Apr 2018 Anish Athalye, Nicholas Carlini

Neural networks are known to be vulnerable to adversarial examples.

Initialization Matters for Adversarial Transfer Learning

1 code implementation10 Dec 2023 Andong Hua, Jindong Gu, Zhiyu Xue, Nicholas Carlini, Eric Wong, Yao Qin

Based on this, we propose Robust Linear Initialization (RoLI) for adversarial finetuning, which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining.

Adversarial Robustness Image Classification +1

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

no code implementations22 Feb 2018 Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, Dawn Song

This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model.

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

no code implementations20 May 2017 Nicholas Carlini, David Wagner

Neural networks are known to be vulnerable to adversarial examples: inputs that are close to natural inputs but classified incorrectly.

Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong

no code implementations15 Jun 2017 Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song

We ask whether a strong defense can be created by combining multiple (possibly weak) defenses.

Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility

no code implementations ICLR 2019 Nicholas Carlini, Ulfar Erlingsson, Nicolas Papernot

Machine learning (ML) research has investigated prototypes: examples that are representative of the behavior to be learned.

Adversarial Robustness

Ground-Truth Adversarial Examples

no code implementations ICLR 2018 Nicholas Carlini, Guy Katz, Clark Barrett, David L. Dill

We demonstrate how ground truths can serve to assess the effectiveness of attack techniques, by comparing the adversarial examples produced by those attacks to the ground truths; and also of defense techniques, by computing the distance to the ground truths before and after the defense is applied, and measuring the improvement.

A critique of the DeepSec Platform for Security Analysis of Deep Learning Models

no code implementations17 May 2019 Nicholas Carlini

At IEEE S&P 2019, the paper "DeepSec: A Uniform Platform for Security Analysis of Deep Learning Model" aims to to "systematically evaluate the existing adversarial attack and defense methods."

Adversarial Attack

High Accuracy and High Fidelity Extraction of Neural Networks

no code implementations3 Sep 2019 Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, Nicolas Papernot

In a model extraction attack, an adversary steals a copy of a remotely deployed machine learning model, given oracle prediction access.

Model extraction Vocal Bursts Intensity Prediction

Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications

no code implementations29 Oct 2019 Nicholas Carlini, Úlfar Erlingsson, Nicolas Papernot

We develop techniques to quantify the degree to which a given (training or testing) example is an outlier in the underlying distribution.

Adversarial Robustness BIG-bench Machine Learning

Evading Deepfake-Image Detectors with White- and Black-Box Attacks

no code implementations1 Apr 2020 Nicholas Carlini, Hany Farid

We show that such forensic classifiers are vulnerable to a range of attacks that reduce the classifier to near-0% accuracy.

Face Swapping

A Partial Break of the Honeypots Defense to Catch Adversarial Attacks

no code implementations23 Sep 2020 Nicholas Carlini

A recent defense proposes to inject "honeypots" into neural networks in order to detect adversarial attacks.

Erratum Concerning the Obfuscated Gradients Attack on Stochastic Activation Pruning

no code implementations30 Sep 2020 Guneet S. Dhillon, Nicholas Carlini

Stochastic Activation Pruning (SAP) (Dhillon et al., 2018) is a defense to adversarial examples that was attacked and found to be broken by the "Obfuscated Gradients" paper (Athalye et al., 2018).

Adversary Instantiation: Lower Bounds for Differentially Private Machine Learning

no code implementations11 Jan 2021 Milad Nasr, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Nicholas Carlini

DP formalizes this data leakage through a cryptographic game, where an adversary must predict if a model was trained on a dataset D, or a dataset D' that differs in just one example. If observing the training algorithm does not meaningfully increase the adversary's odds of successfully guessing which dataset the model was trained on, then the algorithm is said to be differentially private.

BIG-bench Machine Learning

Poisoning the Unlabeled Dataset of Semi-Supervised Learning

no code implementations4 May 2021 Nicholas Carlini

Our attacks are highly effective across datasets and semi-supervised learning methods.

Handcrafted Backdoors in Deep Neural Networks

no code implementations8 Jun 2021 Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

When machine learning training is outsourced to third parties, $backdoor$ $attacks$ become practical as the third party who trains the model may act maliciously to inject hidden behaviors into the otherwise accurate model.

Backdoor Attack

Unsolved Problems in ML Safety

no code implementations28 Sep 2021 Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings.

Debugging Differential Privacy: A Case Study for Privacy Auditing

no code implementations24 Feb 2022 Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, Nicholas Carlini

Differential Privacy can provide provable privacy guarantees for training data in machine learning.

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

no code implementations31 Mar 2022 Florian Tramèr, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, Nicholas Carlini

We show that an adversary who can poison a training dataset can cause models trained on this dataset to leak significant private details of training points belonging to other parties.

Attribute BIG-bench Machine Learning

Increasing Confidence in Adversarial Robustness Evaluations

no code implementations28 Jun 2022 Roland S. Zimmermann, Wieland Brendel, Florian Tramer, Nicholas Carlini

Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations.

Adversarial Robustness

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

no code implementations31 Oct 2022 Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini

Studying data memorization in neural language models helps us understand the risks (e. g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures.

Memorization Open-Ended Question Answering +1

Publishing Efficient On-device Models Increases Adversarial Vulnerability

no code implementations28 Dec 2022 Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

We then show that the vulnerability increases as the similarity between a full-scale and its efficient model increase.

Quantization

Extracting Training Data from Diffusion Models

no code implementations30 Jan 2023 Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images.

Privacy Preserving

Tight Auditing of Differentially Private Machine Learning

no code implementations15 Feb 2023 Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis

Moreover, our auditing scheme requires only two training runs (instead of thousands) to produce tight privacy estimates, by adapting recent advances in tight composition theorems for differential privacy.

Federated Learning

Effective Prompt Extraction from Language Models

no code implementations13 Jul 2023 Yiming Zhang, Nicholas Carlini, Daphne Ippolito

In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability.

Hallucination

A LLM Assisted Exploitation of AI-Guardian

no code implementations20 Jul 2023 Nicholas Carlini

Large language models (LLMs) are now highly capable at a diverse range of tasks.

Computer Security Language Modelling

Privacy Side Channels in Machine Learning Systems

no code implementations11 Sep 2023 Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr

Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum, when in reality, ML models are part of larger systems that include components for training data filtering, output monitoring, and more.

Scalable Extraction of Training Data from (Production) Language Models

no code implementations28 Nov 2023 Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.

Chatbot Memorization

Query-Based Adversarial Prompt Generation

no code implementations19 Feb 2024 Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr

Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior.

Language Modelling

Forcing Diffuse Distributions out of Language Models

1 code implementation16 Apr 2024 Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, Daphne Ippolito

Despite being trained specifically to follow user instructions, today's language models perform poorly when instructed to produce random outputs.

Language Modelling valid

Cannot find the paper you are looking for? You can Submit a new open access paper.