1 code implementation • 27 Jul 2023 • Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson
Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer).
1 code implementation • 29 Jan 2023 • Kai Hu, Andy Zou, Zifan Wang, Klas Leino, Matt Fredrikson
We show that fast ways of bounding the Lipschitz constant for conventional ResNets are loose, and show how to address this by designing a new residual block, leading to the \emph{Linear ResNet} (LiResNet) architecture.
no code implementations • 26 Jan 2023 • Matt Fredrikson, Kaiji Lu, Saranya Vijayakumar, Somesh Jha, Vijay Ganesh, Zifan Wang
Recent techniques that integrate \emph{solver layers} into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques.
no code implementations • 8 Sep 2022 • Marc Juarez, Samuel Yeom, Matt Fredrikson
Our experimental results on real-world datasets show that this approach is effective, achieving 80--100% AUC-ROC in detecting shifts involving the underrepresentation of a demographic group in the training set.
1 code implementation • 1 Jun 2022 • Ravi Mangal, Zifan Wang, Chi Zhang, Klas Leino, Corina Pasareanu, Matt Fredrikson
We present \emph{cascade attack} (CasA), an adversarial attack against cascading ensembles, and show that: (1) there exists an adversarial input for up to 88\% of the samples where the ensemble claims to be certifiably robust and accurate; and (2) the accuracy of a cascading ensemble under our attack is as low as 11\% when it claims to be certifiably robust and accurate on 97\% of the test set.
no code implementations • 24 May 2022 • Zifan Wang, Yuhang Yao, Chaoran Zhang, Han Zhang, Youjie Kang, Carlee Joe-Wong, Matt Fredrikson, Anupam Datta
Second, our analytical and empirical results demonstrate that feature attribution methods cannot capture the nonlinear effect of edge features, while existing subgraph explanation methods are not faithful.
no code implementations • ICLR 2022 • Emily Black, Klas Leino, Matt Fredrikson
Recent work has shown that models trained to the same objective, and which achieve similar measures of accuracy on consistent test data, may nonetheless behave very differently on individual predictions.
no code implementations • ICLR 2022 • Emily Black, Zifan Wang, Matt Fredrikson, Anupam Datta
Counterfactual examples are one of the most commonly-cited methods for explaining the predictions of machine learning models in key areas such as finance and medical diagnosis.
no code implementations • 29 Sep 2021 • Klas Leino, Chi Zhang, Ravi Mangal, Matt Fredrikson, Bryan Parno, Corina Pasareanu
Certifiably robust neural networks employ provable run-time defenses against adversarial examples by checking if the model is locally robust at the input under evaluation.
1 code implementation • 23 Jul 2021 • Klas Leino, Aymeric Fromherz, Ravi Mangal, Matt Fredrikson, Bryan Parno, Corina Păsăreanu
These constraints relate requirements on the order of the classes output by a classifier to conditions on its input, and are expressive enough to encode various interesting examples of classifier safety specifications from the literature.
no code implementations • 21 Jul 2021 • Emily Black, Matt Fredrikson
We introduce leave-one-out unfairness, which characterizes how likely a model's prediction for an individual will change due to the inclusion or removal of a single other person in the model's training data.
1 code implementation • NeurIPS 2021 • Klas Leino, Matt Fredrikson
Certifiable local robustness, which rigorously precludes small-norm adversarial examples, has received significant attention as a means of addressing security concerns in deep learning.
1 code implementation • 20 Mar 2021 • Zifan Wang, Matt Fredrikson, Anupam Datta
Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image's ground-truth class.
1 code implementation • 16 Feb 2021 • Klas Leino, Zifan Wang, Matt Fredrikson
We show that widely-used architectures can be easily adapted to this objective by incorporating efficient global Lipschitz bounds into the network, yielding certifiably-robust models by construction that achieve state-of-the-art verifiable accuracy.
1 code implementation • NeurIPS 2020 • Zifan Wang, Haofan Wang, Shakul Ramkumar, Matt Fredrikson, Piotr Mardziel, Anupam Datta
Feature attributions are a popular tool for explaining the behavior of Deep Neural Networks (DNNs), but have recently been shown to be vulnerable to attacks that produce divergent explanations for nearby inputs.
no code implementations • 19 Feb 2020 • Zifan Wang, Piotr Mardziel, Anupam Datta, Matt Fredrikson
In this work we expand the foundationsof human-understandable concepts with which attributionscan be interpreted beyond "importance" and its visualization; we incorporate the logical concepts of necessity andsufficiency, and the concept of proportionality.
no code implementations • 18 Feb 2020 • Samuel Yeom, Matt Fredrikson
We turn the definition of individual fairness on its head---rather than ascertaining the fairness of a model given a predetermined metric, we find a metric for a given model that satisfies individual fairness.
no code implementations • ICLR 2021 • Aymeric Fromherz, Klas Leino, Matt Fredrikson, Bryan Parno, Corina Păsăreanu
Local robustness ensures that a model classifies all inputs within an $\ell_2$-ball consistently, which precludes various forms of adversarial inputs.
no code implementations • 27 Jun 2019 • Klas Leino, Matt Fredrikson
Membership inference (MI) attacks exploit the fact that machine learning algorithms sometimes leak information about their training data through the learned model.
2 code implementations • 27 Jun 2019 • Zilong Tan, Samuel Yeom, Matt Fredrikson, Ameet Talwalkar
In contrast, we demonstrate the promise of learning a model-aware fair representation, focusing on kernel-based models.
1 code implementation • 21 Jun 2019 • Emily Black, Samuel Yeom, Matt Fredrikson
We present FlipTest, a black-box technique for uncovering discrimination in classifiers.
no code implementations • ICLR 2019 • Klas Leino, Emily Black, Matt Fredrikson, Shayak Sen, Anupam Datta
This overestimation gives rise to feature-wise bias amplification -- a previously unreported form of bias that can be traced back to the features of a trained model.
1 code implementation • NeurIPS 2018 • Samuel Yeom, Anupam Datta, Matt Fredrikson
In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies.
2 code implementations • ICLR 2018 • Klas Leino, Shayak Sen, Anupam Datta, Matt Fredrikson, Linyi Li
We study the problem of explaining a rich class of behavioral properties of deep neural networks.
no code implementations • 27 Sep 2017 • Linyi Li, Matt Fredrikson, Shayak Sen, Anupam Datta
In this report, we applied integrated gradients to explaining a neural network for diabetic retinopathy detection.
1 code implementation • 5 Sep 2017 • Samuel Yeom, Irene Giacomelli, Matt Fredrikson, Somesh Jha
This paper examines the effect that overfitting and influence have on the ability of an attacker to learn information about the training data from machine learning models, either through training set membership inference or attribute inference attacks.
3 code implementations • 25 Jul 2017 • Anupam Datta, Matt Fredrikson, Gihyuk Ko, Piotr Mardziel, Shayak Sen
Machine learnt systems inherit biases against protected classes, historically disparaged groups, from training data.
11 code implementations • 24 Nov 2015 • Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, Ananthram Swami
In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs.