no code implementations • 24 Dec 2024 • Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramèr
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future.
1 code implementation • 22 Nov 2024 • Jie Zhang, Kristina Nikolić, Nicholas Carlini, Florian Tramèr
Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust.
no code implementations • 15 Nov 2024 • Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramèr
For a variety of innocuous prompt categories (e. g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet.
no code implementations • 17 Oct 2024 • Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito
Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i. e., after SFT and DPO).
1 code implementation • 4 Oct 2024 • Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, Florian Tramèr
Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization.
no code implementations • 29 Sep 2024 • Jie Zhang, Debeshee Das, Gautam Kamath, Florian Tramèr
We argue that this approach is fundamentally unsound: to provide convincing evidence, the data creator needs to demonstrate that their attack has a low false positive rate, i. e., that the attack's output is unlikely under the null hypothesis that the model was not trained on the target data.
1 code implementation • 26 Sep 2024 • Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed.
no code implementations • 11 Jul 2024 • Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari
Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i. e., responding to queries about the contents of an input document provided as an image).
no code implementations • 26 Jun 2024 • Fredrik Nestaas, Edoardo Debenedetti, Florian Tramèr
As LLMs are increasingly used to rank third-party content, we expect Preference Manipulation Attacks to emerge as a significant threat.
1 code implementation • 23 Jun 2024 • Debeshee Das, Jie Zhang, Florian Tramèr
Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model.
1 code implementation • 19 Jun 2024 • Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr
Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks.
1 code implementation • 12 Jun 2024 • Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr
To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt.
2 code implementations • 26 Apr 2024 • Michael Aerni, Jie Zhang, Florian Tramèr
Empirical defenses for machine learning privacy forgo the provable guarantees of differential privacy in the hope of achieving higher utility while resisting realistic adversaries.
2 code implementations • 22 Apr 2024 • Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr
Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.
1 code implementation • 30 Mar 2024 • Shanglun Feng, Florian Tramèr
We show that this practice introduces a new risk of privacy backdoors.
no code implementations • 19 Feb 2024 • Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior.
no code implementations • 28 Nov 2023 • Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
2 code implementations • 24 Nov 2023 • Javier Rando, Florian Tramèr
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses.
no code implementations • 11 Sep 2023 • Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr
Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum.
2 code implementations • 16 Jun 2023 • Lukas Fluri, Daniel Paleka, Florian Tramèr
If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth?
1 code implementation • 5 Jun 2023 • Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr
We then design new attacks that reduce the number of bad queries by $1. 5$-$7. 3\times$, but often at a significant increase in total (non-bad) queries.
no code implementations • 27 Feb 2023 • Keane Lucas, Matthew Jagielski, Florian Tramèr, Lujo Bauer, Nicholas Carlini
It is becoming increasingly imperative to design robust ML defenses.
no code implementations • 20 Feb 2023 • Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr
Deep learning models are often trained on distributed, web-scale datasets crawled from the internet.
no code implementations • 15 Feb 2023 • Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis
Moreover, our auditing scheme requires only two training runs (instead of thousands) to produce tight privacy estimates, by adapting recent advances in tight composition theorems for differential privacy.
1 code implementation • 30 Jan 2023 • Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images.
2 code implementations • 13 Dec 2022 • Florian Tramèr, Gautam Kamath, Nicholas Carlini
The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets.
no code implementations • 31 Oct 2022 • Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini
Studying data memorization in neural language models helps us understand the risks (e. g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures.
1 code implementation • 7 Oct 2022 • Chawin Sitawarin, Florian Tramèr, Nicholas Carlini
Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries.
no code implementations • 3 Oct 2022 • Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content.
1 code implementation • 25 Aug 2022 • Harsh Chaudhari, John Abascal, Alina Oprea, Matthew Jagielski, Florian Tramèr, Jonathan Ullman
Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model.
no code implementations • 30 Jun 2022 • Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang
In memorization, models overfit specific training examples and become susceptible to privacy attacks.
no code implementations • 31 Mar 2022 • Florian Tramèr, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, Nicholas Carlini
We show that an adversary who can poison a training dataset can cause models trained on this dataset to leak significant private details of training points belonging to other parties.
no code implementations • 11 Feb 2022 • Hannah Brown, Katherine Lee, FatemehSadat Mireshghallah, Reza Shokri, Florian Tramèr
Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets.
4 code implementations • ICLR 2022 • Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto
Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead.
2 code implementations • 16 Aug 2021 • Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, aditi raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang
AI is undergoing a paradigm shift with the rise of models (e. g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks.
no code implementations • 24 Jul 2021 • Florian Tramèr
We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance {\epsilon} (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance {\epsilon}/2.
1 code implementation • 28 Jun 2021 • Evani Radiya-Dixit, Sanghyun Hong, Nicholas Carlini, Florian Tramèr
We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for all before being published (at which point they are scraped) and must thereafter fool all future models -- including models trained adaptively against the users' past attacks, or models that use technologies discovered after the attack.
1 code implementation • NeurIPS 2021 • Mani Malek, Ilya Mironov, Karthik Prasad, Igor Shilov, Florian Tramèr
We propose two novel approaches based on, respectively, the Laplace mechanism and the PATE framework, and demonstrate their effectiveness on standard benchmarks.
1 code implementation • ICLR 2021 • Florian Tramèr, Dan Boneh
We demonstrate that differentially private machine learning has not yet reached its "AlexNet moment" on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets.
1 code implementation • ICML 2020 • Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, Jörn-Henrik Jacobsen
Adversarial examples are malicious inputs crafted to induce misclassification.
9 code implementations • 10 Dec 2019 • Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, Sen Zhao
FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches.
1 code implementation • NeurIPS 2019 • Florian Tramèr, Dan Boneh
Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e. g., small $\ell_\infty$-noise).
no code implementations • 25 Mar 2019 • Jörn-Henrik Jacobsen, Jens Behrmannn, Nicholas Carlini, Florian Tramèr, Nicolas Papernot
Excessive invariance is not limited to models trained to be robust to perturbation-based $\ell_p$-norm adversaries.
1 code implementation • 2 Dec 2018 • Edward Chou, Florian Tramèr, Giancarlo Pellegrino, Dan Boneh
By leveraging the neural network's susceptibility to attacks and by using techniques from model interpretability and object detection as detection mechanisms, SentiNet turns a weakness of a model into a strength.
Cryptography and Security
1 code implementation • 8 Nov 2018 • Florian Tramèr, Pascal Dupré, Gili Rusak, Giancarlo Pellegrino, Dan Boneh
On the other, we present a concrete set of attacks on visual ad-blockers by constructing adversarial examples in a real web page context.
1 code implementation • ICLR 2019 • Florian Tramèr, Dan Boneh
As Machine Learning (ML) gets applied to security-critical or sensitive domains, there is a growing need for integrity and privacy for outsourced ML computations.
13 code implementations • ICLR 2018 • Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel
We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss.
2 code implementations • 11 Apr 2017 • Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel
Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time.
1 code implementation • 9 Sep 2016 • Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart
In such attacks, an adversary with black-box access, but no prior knowledge of an ML model's parameters or training data, aims to duplicate the functionality of (i. e., "steal") the model.