1 code implementation • ICML 2020 • Pratyush Maini, Eric Wong, Zico Kolter
Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers.
1 code implementation • 25 Feb 2024 • Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content.
no code implementations • 10 Dec 2023 • Andong Hua, Jindong Gu, Zhiyu Xue, Nicholas Carlini, Eric Wong, Yao Qin
Specifically, we reveal that with a standard pretrained model, Parameter-Efficient Finetuning~(PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks, even with adversarial training during finetuning.
1 code implementation • 25 Oct 2023 • Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong
An explanation of a machine learning model is considered "faithful" if it accurately reflects the model's decision-making process.
1 code implementation • 19 Oct 2023 • Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, Sijia Liu
To address these challenges, we introduce the concept of 'weight saliency' for MU, drawing parallels with input saliency in model explanation.
1 code implementation • 12 Oct 2023 • Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong
PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention.
1 code implementation • 11 Oct 2023 • Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text.
1 code implementation • 5 Oct 2023 • Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas
Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.
no code implementations • 13 Aug 2023 • Aaditya Naik, Adam Stein, Yinjun Wu, Mayur Naik, Eric Wong
Finding errors in machine learning applications requires a thorough exploration of their behavior over data.
no code implementations • 1 Jun 2023 • Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
Meaningfully comparing language models is challenging with current explanation methods.
no code implementations • 25 May 2023 • Adam Stein, Yinjun Wu, Eric Wong, Mayur Naik
It is well-known that real-world changes constituting distribution shift adversely affect model performance.
1 code implementation • 2 Mar 2023 • Aaditya Naik, Yinjun Wu, Mayur Naik, Eric Wong
Test-time adaptation reduces these violations by up to 68. 7% with relative performance improvement up to 32%.
1 code implementation • 21 Feb 2023 • Tai Nguyen, Eric Wong
In-context learning (ICL) is a powerful paradigm emerged from large language models (LLMs).
1 code implementation • 8 Feb 2023 • Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner
Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language.
1 code implementation • 31 Jan 2023 • Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, Chris Callison-Burch
While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka.
1 code implementation • CVPR 2023 • Saachi Jain, Hadi Salman, Alaa Khaddaj, Eric Wong, Sung Min Park, Aleksander Madry
It is commonly believed that in transfer learning including more pre-training data translates into better performance.
1 code implementation • 6 Jul 2022 • Hadi Salman, Saachi Jain, Andrew Ilyas, Logan Engstrom, Eric Wong, Aleksander Madry
Using transfer learning to adapt a pre-trained "source model" to a downstream "target task" can dramatically increase performance with seemingly no downside.
1 code implementation • ICLR 2022 • Saachi Jain, Hadi Salman, Eric Wong, Pengchuan Zhang, Vibhav Vineet, Sai Vemprala, Aleksander Madry
Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools.
1 code implementation • CVPR 2022 • Hadi Salman, Saachi Jain, Eric Wong, Aleksander Mądry
Certified patch defenses can guarantee robustness of an image classifier to arbitrary changes within a bounded contiguous region.
1 code implementation • 16 Jun 2021 • Shaoru Chen, Eric Wong, J. Zico Kolter, Mahyar Fazlyab
Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non-convex optimization problem, for which several past works have proposed convex relaxations as a promising alternative.
2 code implementations • 11 May 2021 • Eric Wong, Shibani Santurkar, Aleksander Mądry
We show how fitting sparse linear models over learned deep feature representations can lead to more debuggable neural networks.
1 code implementation • ICLR 2021 • Eric Wong, J. Zico Kolter
In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation.
no code implementations • 30 Jun 2020 • Eric Wong, Tim Schneider, Joerg Schmitt, Frank R. Schmidt, J. Zico Kolter
Additionally, we show how specific intervals of fuel injection quantities can be targeted to maximize robustness for certain ranges, allowing us to train a virtual sensor for fuel injection which is provably guaranteed to have at most 10. 69% relative error under noise while maintaining 3% relative error on non-adversarial data within normalized fuel injection ranges of 0. 6 to 1. 0.
4 code implementations • ICML 2020 • Leslie Rice, Eric Wong, J. Zico Kolter
Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping.
10 code implementations • ICLR 2020 • Eric Wong, Leslie Rice, J. Zico Kolter
Furthermore we show that FGSM adversarial training can be further accelerated by using standard techniques for efficient training of deep networks, allowing us to learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with $\epsilon=8/255$ in 6 minutes, and a robust ImageNet classifier with 43% robust accuracy at $\epsilon=2/255$ in 12 hours, in comparison to past work based on "free" adversarial training which took 10 and 50 hours to reach the same respective thresholds.
1 code implementation • 9 Sep 2019 • Pratyush Maini, Eric Wong, J. Zico Kolter
Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers.
2 code implementations • 21 Feb 2019 • Eric Wong, Frank R. Schmidt, J. Zico Kolter
In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance.
4 code implementations • NeurIPS 2018 • Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, J. Zico Kolter
Recent work has developed methods for learning deep network classifiers that are provably robust to norm-bounded adversarial perturbation; however, these methods are currently only possible for relatively small feedforward networks.
8 code implementations • ICML 2018 • Eric Wong, J. Zico Kolter
We propose a method to learn deep ReLU-based classifiers that are provably robust against norm-bounded adversarial perturbations on the training data.