Search Results for author: Jascha Sohl-Dickstein

Found 82 papers, 45 papers with code

General-Purpose In-Context Learning by Meta-Learning Transformers

no code implementations8 Dec 2022 Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, Luke Metz

We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count.

Inductive Bias Meta-Learning

VeLO: Training Versatile Learned Optimizers by Scaling Up

1 code implementation17 Nov 2022 Luke Metz, James Harrison, C. Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, Jascha Sohl-Dickstein

While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers.

A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases

1 code implementation22 Sep 2022 James Harrison, Luke Metz, Jascha Sohl-Dickstein

We apply the resulting learned optimizer to a variety of neural network training tasks, where it outperforms the current state of the art learned optimizer -- at matched optimizer computational overhead -- with regard to optimization performance and meta-training speed, and is capable of generalization to tasks far different from those it was meta-trained on.

Inductive Bias

Fast Finite Width Neural Tangent Kernel

2 code implementations17 Jun 2022 Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks.

Meta-Learning

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

1 code implementation15 Jun 2022 Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

1 code implementation9 Jun 2022 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramón Risco Delgado, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Timothy Telleen-Lawton, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, ZiRui Wang, Ziyi Wu

BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models.

Common Sense Reasoning Memorization

Practical tradeoffs between memory, compute, and performance in learned optimizers

1 code implementation22 Mar 2022 Luke Metz, C. Daniel Freeman, James Harrison, Niru Maheswaranathan, Jascha Sohl-Dickstein

We further leverage our analysis to construct a learned optimizer that is both faster and more memory efficient than previous work.

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

2 code implementations27 Dec 2021 Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers.

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

2 code implementations6 Dec 2021 Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo, Samuel Cahyawijaya, Emile Chapuis, Wanxiang Che, Mukund Choudhary, Christian Clauss, Pierre Colombo, Filip Cornell, Gautier Dagan, Mayukh Das, Tanay Dixit, Thomas Dopierre, Paul-Alexis Dray, Suchitra Dubey, Tatiana Ekeinhor, Marco Di Giovanni, Tanya Goyal, Rishabh Gupta, Louanes Hamla, Sang Han, Fabrice Harel-Canada, Antoine Honore, Ishan Jindal, Przemyslaw K. Joniak, Denis Kleyko, Venelin Kovatchev, Kalpesh Krishna, Ashutosh Kumar, Stefan Langer, Seungjae Ryan Lee, Corey James Levinson, Hualou Liang, Kaizhao Liang, Zhexiong Liu, Andrey Lukyanenko, Vukosi Marivate, Gerard de Melo, Simon Meoni, Maxime Meyer, Afnan Mir, Nafise Sadat Moosavi, Niklas Muennighoff, Timothy Sum Hon Mun, Kenton Murray, Marcin Namysl, Maria Obedkova, Priti Oli, Nivranshu Pasricha, Jan Pfister, Richard Plant, Vinay Prabhu, Vasile Pais, Libo Qin, Shahab Raji, Pawan Kumar Rajpoot, Vikas Raunak, Roy Rinberg, Nicolas Roberts, Juan Diego Rodriguez, Claude Roux, Vasconcellos P. H. S., Ananya B. Sai, Robin M. Schmidt, Thomas Scialom, Tshephisho Sefara, Saqib N. Shamsi, Xudong Shen, Haoyue Shi, Yiwen Shi, Anna Shvets, Nick Siegel, Damien Sileo, Jamie Simon, Chandan Singh, Roman Sitelew, Priyank Soni, Taylor Sorensen, William Soto, Aman Srivastava, KV Aditya Srivatsa, Tony Sun, Mukund Varma T, A Tabassum, Fiona Anting Tan, Ryan Teehan, Mo Tiwari, Marie Tolkiehn, Athena Wang, Zijian Wang, Gloria Wang, Zijie J. Wang, Fuxuan Wei, Bryan Wilie, Genta Indra Winata, Xinyi Wu, Witold Wydmański, Tianbao Xie, Usama Yaseen, Michael A. Yee, Jing Zhang, Yue Zhang

Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on.

Data Augmentation

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

1 code implementation5 Oct 2021 James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function.

Training Learned Optimizers with Randomly Initialized Learned Optimizers

no code implementations14 Jan 2021 Luke Metz, C. Daniel Freeman, Niru Maheswaranathan, Jascha Sohl-Dickstein

We show that a population of randomly initialized learned optimizers can be used to train themselves from scratch in an online fashion, without resorting to a hand designed optimizer in any part of the process.

Overcoming barriers to the training of effective learned optimizers

no code implementations1 Jan 2021 Luke Metz, Niru Maheswaranathan, C. Daniel Freeman, Ben Poole, Jascha Sohl-Dickstein

In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters.

The large learning rate phase of deep learning

1 code implementation1 Jan 2021 Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks.

Parallel Training of Deep Networks with Local Updates

no code implementations7 Dec 2020 Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel

Deep learning models trained on large data sets have been widely successful in both vision and language domains.

Score-Based Generative Modeling through Stochastic Differential Equations

8 code implementations ICLR 2021 Yang song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9. 89 and FID of 2. 20, a competitive likelihood of 2. 99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.

Colorization Image Inpainting +1

Towards NNGP-guided Neural Architecture Search

1 code implementation11 Nov 2020 Daniel S. Park, Jaehoon Lee, Daiyi Peng, Yuan Cao, Jascha Sohl-Dickstein

Since NNGP inference provides a cheap measure of performance of a network architecture, we investigate its potential as a signal for neural architecture search (NAS).

Neural Architecture Search

Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves

no code implementations23 Sep 2020 Luke Metz, Niru Maheswaranathan, C. Daniel Freeman, Ben Poole, Jascha Sohl-Dickstein

In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters.

Finite Versus Infinite Neural Networks: an Empirical Study

1 code implementation NeurIPS 2020 Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods.

A new method for parameter estimation in probabilistic models: Minimum probability flow

1 code implementation17 Jul 2020 Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function.

Infinite attention: NNGP and NTK for deep attention networks

1 code implementation ICML 2020 Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures.

Deep Attention Gaussian Processes

Exact posterior distributions of wide Bayesian neural networks

2 code implementations18 Jun 2020 Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large.

Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling

3 code implementations NeurIPS 2020 Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, Yoshua Bengio

To make that practical, we show that sampling from this modified density can be achieved by sampling in latent space according to an energy-based model induced by the sum of the latent prior log-density and the discriminator output score.

Image Generation

The large learning rate phase of deep learning: the catapult mechanism

no code implementations4 Mar 2020 Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks.

On the infinite width limit of neural networks with a standard parameterization

1 code implementation21 Jan 2020 Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee

However, the extrapolation of both of these parameterizations to infinite width is problematic.

Invertible Convolutional Flow

1 code implementation NeurIPS 2019 Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, Daniel Duckworth

We show that these transforms allow more effective normalizing flow models to be developed for generative image models.

Neural reparameterization improves structural optimization

1 code implementation NeurIPS Workshop Deep_Invers 2019 Stephan Hoyer, Jascha Sohl-Dickstein, Sam Greydanus

Structural optimization is a popular method for designing objects such as bridge trusses, airplane wings, and optical devices.

Using learned optimizers to make models robust to input noise

no code implementations8 Jun 2019 Luke Metz, Niru Maheswaranathan, Jonathon Shlens, Jascha Sohl-Dickstein, Ekin D. Cubuk

State-of-the art vision models can achieve superhuman performance on image classification tasks when testing and training data come from the same distribution.

General Classification Image Classification +1

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

no code implementations9 May 2019 Daniel S. Park, Jascha Sohl-Dickstein, Quoc V. Le, Samuel L. Smith

We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions.

Guided Evolutionary Strategies: Escaping the curse of dimensionality in random search

no code implementations ICLR 2019 Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, Jascha Sohl-Dickstein

This arises when an approximate gradient is easier to compute than the full gradient (e. g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e. g. in certain reinforcement learning applications or training networks with discrete variables).

Meta-Learning

Learning Unsupervised Learning Rules

no code implementations ICLR 2019 Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein

Here, our desired task (meta-objective) is the performance of the representation on semi-supervised classification, and we meta-learn an algorithm -- an unsupervised weight update rule -- that produces representations that perform well under this meta-objective.

Meta-Learning

Learned optimizers that outperform on wall-clock and validation loss

no code implementations ICLR 2019 Luke Metz, Niru Maheswaranathan, Jeremy Nixon, Daniel Freeman, Jascha Sohl-Dickstein

We demonstrate these results on problems where our learned optimizer trains convolutional networks in a fifth of the wall-clock time compared to tuned first-order methods, and with an improvement

A Mean Field Theory of Batch Normalization

no code implementations ICLR 2019 Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks.

Eliminating all bad Local Minima from Loss Landscapes without even adding an Extra Unit

no code implementations12 Jan 2019 Jascha Sohl-Dickstein, Kenji Kawaguchi

Recent work has noted that all bad local minima can be removed from neural network loss landscapes, by adding a single unit with a particular parameterization.

Measuring the Effects of Data Parallelism on Neural Network Training

no code implementations8 Nov 2018 Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes.

Understanding and correcting pathologies in the training of learned optimizers

1 code implementation24 Oct 2018 Luke Metz, Niru Maheswaranathan, Jeremy Nixon, C. Daniel Freeman, Jascha Sohl-Dickstein

Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks.

Adversarial Reprogramming of Neural Networks

6 code implementations ICLR 2019 Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein

Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker.

BIG-bench Machine Learning General Classification

Guided evolutionary strategies: Augmenting random search with surrogate gradients

1 code implementation ICLR 2019 Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, Jascha Sohl-Dickstein

We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search.

Meta-Learning

Stochastic natural gradient descent draws posterior samples in function space

no code implementations25 Jun 2018 Samuel L. Smith, Daniel Duckworth, Semon Rezchikov, Quoc V. Le, Jascha Sohl-Dickstein

Recent work has argued that stochastic gradient descent can approximate the Bayesian uncertainty in model parameters near local minima.

PCA of high dimensional random walks with comparison to neural network training

no code implementations NeurIPS 2018 Joseph M. Antognini, Jascha Sohl-Dickstein

One technique to visualize the training of neural networks is to perform PCA on the parameters over the course of training and to project to the subspace spanned by the first few PCA components.

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

3 code implementations ICML 2018 Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme.

Meta-Learning Update Rules for Unsupervised Representation Learning

2 code implementations ICLR 2019 Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein

Specifically, we target semi-supervised classification performance, and we meta-learn an algorithm -- an unsupervised weight update rule -- that produces representations useful for this task.

Meta-Learning Representation Learning

Sensitivity and Generalization in Neural Networks: an Empirical Study

no code implementations ICLR 2018 Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models.

Data Augmentation Image Classification

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

no code implementations NeurIPS 2018 Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich.

BIG-bench Machine Learning

Generalizing Hamiltonian Monte Carlo with Neural Networks

2 code implementations ICLR 2018 Daniel Levy, Matthew D. Hoffman, Jascha Sohl-Dickstein

We present a general-purpose method to train Markov chain Monte Carlo kernels, parameterized by deep neural networks, that converge and mix quickly to their target distribution.

Deep Neural Networks as Gaussian Processes

7 code implementations ICLR 2018 Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network.

Bayesian Inference Gaussian Processes

A Correspondence Between Random Neural Networks and Statistical Field Theory

no code implementations18 Oct 2017 Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics.

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

3 code implementations NeurIPS 2017 Maithra Raghu, Justin Gilmer, Jason Yosinski, Jascha Sohl-Dickstein

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods).

Learned Optimizers that Scale and Generalize

1 code implementation ICML 2017 Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, Jascha Sohl-Dickstein

Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks.

Improved generator objectives for GANs

no code implementations8 Dec 2016 Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova

We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization.

Density Ratio Estimation

Capacity and Trainability in Recurrent Neural Networks

1 code implementation29 Nov 2016 Jasmine Collins, Jascha Sohl-Dickstein, David Sussillo

They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter.

Survey of Expressivity in Deep Neural Networks

no code implementations24 Nov 2016 Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, Jascha Sohl-Dickstein

This quantity grows exponentially in the depth of the network, and is responsible for the depth sensitivity observed.

Unrolled Generative Adversarial Networks

9 code implementations7 Nov 2016 Luke Metz, Ben Poole, David Pfau, Jascha Sohl-Dickstein

We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator.

Deep Information Propagation

1 code implementation4 Nov 2016 Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein

We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks.

Exponential expressivity in deep neural networks through transient chaos

1 code implementation NeurIPS 2016 Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli

We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights.

On the Expressive Power of Deep Neural Networks

no code implementations ICML 2017 Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, Jascha Sohl-Dickstein

We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute.

Density estimation using Real NVP

29 code implementations27 May 2016 Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning.

Ranked #21 on Image Generation on ImageNet 32x32 (bpd metric)

BIG-bench Machine Learning Density Estimation +1

A universal tradeoff between power, precision and speed in physical communication

no code implementations24 Mar 2016 Subhaneil Lahiri, Jascha Sohl-Dickstein, Surya Ganguli

Maximizing the speed and precision of communication while minimizing power dissipation is a fundamental engineering design goal.

Friction

A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

no code implementations13 Sep 2015 Andrew B. Berger, Mayur Mudigonda, Michael R. DeWeese, Jascha Sohl-Dickstein

In most sampling algorithms, including Hamiltonian Monte Carlo, transition rates between states correspond to the probability of making a transition in a single time step, and are constrained to be less than or equal to 1.

Deep Knowledge Tracing

6 code implementations NeurIPS 2015 Chris Piech, Jonathan Spencer, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas Guibas, Jascha Sohl-Dickstein

Knowledge tracing---where a machine models the knowledge of a student as they interact with coursework---is a well established problem in computer supported education.

Knowledge Tracing

Note on Equivalence Between Recurrent Neural Network Time Series Models and Variational Bayesian Models

no code implementations29 Apr 2015 Jascha Sohl-Dickstein, Diederik P. Kingma

We observe that the standard log likelihood training objective for a Recurrent Neural Network (RNN) model of time series data is equivalent to a variational Bayesian training objective, given the proper choice of generative and inference models.

Time Series

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

5 code implementations12 Mar 2015 Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.

Hamiltonian Monte Carlo Without Detailed Balance

2 code implementations18 Sep 2014 Jascha Sohl-Dickstein, Mayur Mudigonda, Michael R. DeWeese

We present a method for performing Hamiltonian Monte Carlo that largely eliminates sample rejection for typical hyperparameters.

Analyzing noise in autoencoders and deep networks

no code implementations6 Jun 2014 Ben Poole, Jascha Sohl-Dickstein, Surya Ganguli

Autoencoders have emerged as a useful framework for unsupervised learning of internal representations, and a wide variety of apparently conceptually disparate regularization techniques have been proposed to generate useful features.

Denoising

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

1 code implementation9 Nov 2013 Jascha Sohl-Dickstein, Ben Poole, Surya Ganguli

This algorithm contrasts with earlier stochastic second order techniques that treat the Hessian of each contributing function as a noisy approximation to the full Hessian, rather than as a target for direct estimation.

Training sparse natural image models with a fast Gibbs sampler of an extended state space

no code implementations NeurIPS 2012 Lucas Theis, Jascha Sohl-Dickstein, Matthias Bethge

We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models.

Efficient Methods for Unsupervised Learning of Probabilistic Models

no code implementations19 May 2012 Jascha Sohl-Dickstein

In this thesis I develop a variety of techniques to train, evaluate, and sample from intractable and high dimensional probabilistic models.

An Unsupervised Algorithm For Learning Lie Group Transformations

no code implementations7 Jan 2010 Jascha Sohl-Dickstein, Ching Ming Wang, Bruno A. Olshausen

Transformation operators are represented in their eigen-basis, reducing the computational complexity of parameter estimation to that of training a linear transformation model.

Translation

Minimum Probability Flow Learning

1 code implementation25 Jun 2009 Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives.

Cannot find the paper you are looking for? You can Submit a new open access paper.