no code implementations • 8 Aug 2024 • Stanislav Fort, Balaji Lakshminarayanan
We propose a novel, easy-to-use approach to achieving high-quality representations that lead to adversarial robustness through the use of multi-resolution input representations and dynamic self-ensembling of intermediate layer predictions.
no code implementations • 5 Dec 2023 • Stanislav Fort
We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models.
1 code implementation • 4 Aug 2023 • Stanislav Fort
We show that we can easily design a single adversarial perturbation $P$ that changes the class of $n$ images $X_1, X_2,\dots, X_n$ from their original, unperturbed classes $c_1, c_2,\dots, c_n$ to desired (not necessarily all the same) classes $c^*_1, c^*_2,\dots, c^*_n$ for up to hundreds of images and target classes at once.
2 code implementations • 15 Dec 2022 • Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan
In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
no code implementations • 4 Nov 2022 • Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand.
1 code implementation • 11 Oct 2022 • Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz
Deep neural network classifiers partition input space into high confidence regions for each class.
2 code implementations • 23 Aug 2022 • Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark
We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.
no code implementations • 11 Jul 2022 • Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, Jared Kaplan
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.
3 code implementations • 12 Apr 2022 • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.
1 code implementation • 18 Jan 2022 • Stanislav Fort
There has been a significant progress in detecting out-of-distribution (OOD) inputs in neural networks recently, primarily due to the use of large models pretrained on large datasets, and an emerging use of multi-modality.
1 code implementation • ICLR 2022 • Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli
In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large.
4 code implementations • 16 Jun 2021 • Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, Balaji Lakshminarayanan
Mahalanobis distance (MD) is a simple and popular post-processing method for detecting out-of-distribution (OOD) inputs in neural networks.
1 code implementation • NeurIPS 2021 • Stanislav Fort, Jie Ren, Balaji Lakshminarayanan
Near out-of-distribution detection (OOD) is a major challenge for deep neural networks.
Ranked #2 on Out-of-Distribution Detection on CIFAR-10 vs CIFAR-100 (using extra training data)
Out-of-Distribution Detection Out of Distribution (OOD) Detection +1
no code implementations • 27 May 2021 • Stanislav Fort, Andrew Brock, Razvan Pascanu, Soham De, Samuel L. Smith
In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets.
Ranked #124 on Image Classification on ImageNet
1 code implementation • 22 Apr 2021 • James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse
Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective.
no code implementations • 1 Jan 2021 • Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel Stern Schoenholz
Deep neural network classifiers naturally partition input space into regions belonging to different classes.
no code implementations • NeurIPS 2020 • Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK.
2 code implementations • ICLR 2021 • Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network.
no code implementations • ICLR 2020 • Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, Krzysztof Geras
We argue for the existence of the "break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD.
1 code implementation • 5 Dec 2019 • Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan
One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space.
no code implementations • 14 Oct 2019 • Stanislav Fort, Surya Ganguli
The local geometry of high dimensional neural network loss landscapes can both challenge our cherished theoretical intuitions as well as dramatically impact the practical success of neural network training.
1 code implementation • NeurIPS 2019 • Stanislav Fort, Stanislaw Jastrzebski
There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks.
no code implementations • 28 Jan 2019 • Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, Srini Narayanan
In particular, we study how stiffness depends on 1) class membership, 2) distance between data points in the input space, 3) training iteration, and 4) learning rate.
no code implementations • 17 Dec 2018 • Yihui Quek, Stanislav Fort, Hui Khoon Ng
We demonstrate that our algorithm learns to work with basis, symmetric informationally complete (SIC), as well as other types of POVMs.
no code implementations • 6 Jul 2018 • Stanislav Fort, Adam Scherlis
We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the $\mathrm{ReLU}$ and $\tanh$ non-linearities, and a similar effect for convolutional networks.
no code implementations • 2 Dec 2017 • Stanislav Fort
Supermassive black holes at centers of clusters of galaxies strongly interact with their host environment via AGN feedback.
1 code implementation • ICLR 2018 • Stanislav Fort
We show that Gaussian prototypical networks are a preferred architecture over vanilla prototypical networks with an equivalent number of parameters.