Search Results for author: Stanislav Fort

Found 26 papers, 14 papers with code

Scaling Laws for Adversarial Attacks on Language Model Activations

no code implementations • 5 Dec 2023 • Stanislav Fort

We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models.

Language Modelling

Paper
Add Code

Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels

1 code implementation • 4 Aug 2023 • Stanislav Fort

We show that we can easily design a single adversarial perturbation $P$ that changes the class of $n$ images $X_1, X_2,\dots, X_n$ from their original, unperturbed classes $c_1, c_2,\dots, c_n$ to desired (not necessarily all the same) classes $c^*_1, c^*_2,\dots, c^*_n$ for up to hundreds of images and target classes at once.

Adversarial Attack

Paper
Code

Constitutional AI: Harmlessness from AI Feedback

2 code implementations • 15 Dec 2022 • Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan

In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.

Decision Making

1,988

Paper
Code

Measuring Progress on Scalable Oversight for Large Language Models

no code implementations • 4 Nov 2022 • Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand.

Experimental Design Language Modelling +2

Paper
Add Code

What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

1 code implementation • 11 Oct 2022 • Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz

Deep neural network classifiers partition input space into high confidence regions for each class.

Paper
Code

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

2 code implementations • 23 Aug 2022 • Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark

We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.

Language Modelling

1,439

Paper
Code

Language Models (Mostly) Know What They Know

no code implementations • 11 Jul 2022 • Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, Jared Kaplan

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.

Multiple-choice

Paper
Add Code

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

3 code implementations • 12 Apr 2022 • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.

Code Generation Out of Distribution (OOD) Detection +2

1,439

Paper
Code

Adversarial vulnerability of powerful near out-of-distribution detection

1 code implementation • 18 Jan 2022 • Stanislav Fort

There has been a significant progress in detecting out-of-distribution (OOD) inputs in neural networks recently, primarily due to the use of large models pretrained on large datasets, and an emerging use of multi-modality.

Adversarial Robustness Out-of-Distribution Detection +1

Paper
Code

How many degrees of freedom do we need to train deep networks: a loss landscape perspective

1 code implementation • ICLR 2022 • Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli

In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large.

Paper
Code

A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection

3 code implementations • 16 Jun 2021 • Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, Balaji Lakshminarayanan

Mahalanobis distance (MD) is a simple and popular post-processing method for detecting out-of-distribution (OOD) inputs in neural networks.

Intent Detection Out-of-Distribution Detection +1

1,363

Paper
Code

Exploring the Limits of Out-of-Distribution Detection

1 code implementation • NeurIPS 2021 • Stanislav Fort, Jie Ren, Balaji Lakshminarayanan

Near out-of-distribution detection (OOD) is a major challenge for deep neural networks.

Ranked #2 on Out-of-Distribution Detection on CIFAR-10 vs CIFAR-100 (using extra training data)

Out-of-Distribution Detection Out of Distribution (OOD) Detection +1

Paper
Code

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error

no code implementations • 27 May 2021 • Stanislav Fort, Andrew Brock, Razvan Pascanu, Soham De, Samuel L. Smith

In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets.

Ranked #124 on Image Classification on ImageNet

Data Augmentation Image Classification

Paper
Add Code

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

1 code implementation • 22 Apr 2021 • James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective.

Paper
Code

Slice, Dice, and Optimize: Measuring the Dimension of Neural Network Class Manifolds

no code implementations • 1 Jan 2021 • Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel Stern Schoenholz

Deep neural network classifiers naturally partition input space into regions belonging to different classes.

Paper
Add Code

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

no code implementations • NeurIPS 2020 • Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli

We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK.

Paper
Add Code

Training independent subnetworks for robust prediction

2 code implementations • ICLR 2021 • Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran

Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network.

1,363

Paper
Code

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

no code implementations • ICLR 2020 • Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, Krzysztof Geras

We argue for the existence of the "break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD.

Paper
Add Code

Deep Ensembles: A Loss Landscape Perspective

1 code implementation • 5 Dec 2019 • Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan

One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space.

Paper
Code

Emergent properties of the local geometry of neural loss landscapes

no code implementations • 14 Oct 2019 • Stanislav Fort, Surya Ganguli

The local geometry of high dimensional neural network loss landscapes can both challenge our cherished theoretical intuitions as well as dramatically impact the practical success of neural network training.

Paper
Add Code

Large Scale Structure of Neural Network Loss Landscapes

1 code implementation • NeurIPS 2019 • Stanislav Fort, Stanislaw Jastrzebski

There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks.

Paper
Code

Stiffness: A New Perspective on Generalization in Neural Networks

no code implementations • 28 Jan 2019 • Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, Srini Narayanan

In particular, we study how stiffness depends on 1) class membership, 2) distance between data points in the input space, 3) training iteration, and 4) learning rate.

Paper
Add Code

Adaptive Quantum State Tomography with Neural Networks

no code implementations • 17 Dec 2018 • Yihui Quek, Stanislav Fort, Hui Khoon Ng

We demonstrate that our algorithm learns to work with basis, symmetric informationally complete (SIC), as well as other types of POVMs.

Quantum State Tomography

Paper
Add Code

The Goldilocks zone: Towards better understanding of neural network loss landscapes

no code implementations • 6 Jul 2018 • Stanislav Fort, Adam Scherlis

We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the $\mathrm{ReLU}$ and $\tanh$ non-linearities, and a similar effect for convolutional networks.

Paper
Add Code

Towards understanding feedback from supermassive black holes using convolutional neural networks

no code implementations • 2 Dec 2017 • Stanislav Fort

Supermassive black holes at centers of clusters of galaxies strongly interact with their host environment via AGN feedback.

Paper
Add Code

Gaussian Prototypical Networks for Few-Shot Learning on Omniglot

1 code implementation • ICLR 2018 • Stanislav Fort

We show that Gaussian prototypical networks are a preferred architecture over vanilla prototypical networks with an equivalent number of parameters.

Classification Clustering +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.