Search Results for author: Jonathan Frankle

Found 33 papers, 14 papers with code

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

24 code implementations ICLR 2019 Jonathan Frankle, Michael Carbin

Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.

Network Pruning

Stabilizing the Lottery Ticket Hypothesis

3 code implementations5 Mar 2019 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

With this change, it finds small subnetworks of deeper networks (e. g., 80% sparsity on Resnet-50) that can complete the training process to match the accuracy of the original network on more challenging tasks (e. g., ImageNet).

Linear Mode Connectivity and the Lottery Ticket Hypothesis

2 code implementations ICML 2020 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e. g., random data order and augmentation).

Linear Mode Connectivity

The Early Phase of Neural Network Training

1 code implementation ICLR 2020 Jonathan Frankle, David J. Schwab, Ari S. Morcos

We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset.

Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

4 code implementations ICLR 2021 Jonathan Frankle, David J. Schwab, Ari S. Morcos

A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features.

Style Transfer

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images

1 code implementation25 Oct 2023 Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov

This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce.

Transfer Learning

What is the State of Neural Network Pruning?

1 code implementation6 Mar 2020 Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag

Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years.

Network Pruning

Comparing Rewinding and Fine-tuning in Neural Network Pruning

2 code implementations ICLR 2020 Alex Renda, Jonathan Frankle, Michael Carbin

Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding.

Network Pruning

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

1 code implementation CVPR 2021 Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance.

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

1 code implementation2 Jun 2022 Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite

A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network.

Dissecting Pruned Neural Networks

no code implementations29 Jun 2019 Jonathan Frankle, David Bau

Namely, we consider the effect of removing unnecessary structure on the number of hidden units that learn disentangled representations of human-recognizable concepts as identified by network dissection.

On the Predictability of Pruning Across Scales

no code implementations18 Jun 2020 Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit

We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task.

Are all negatives created equal in contrastive instance discrimination?

no code implementations13 Oct 2020 Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations.

Image Classification Self-Supervised Learning

Revisiting "Qualitatively Characterizing Neural Network Optimization Problems"

no code implementations NeurIPS Workshop DL-IG 2020 Jonathan Frankle

We revisit and extend the experiments of Goodfellow et al. (2014), who showed that - for then state-of-the-art networks - "the objective function has a simple, approximately convex shape" along the linear path between initialization and the trained weights.

Studying the Consistency and Composability of Lottery Ticket Pruning Masks

no code implementations30 Apr 2021 Rajiv Movva, Jonathan Frankle, Michael Carbin

Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy.

What can linear interpolation of neural network loss landscapes tell us?

no code implementations30 Jun 2021 Tiffany Vlaar, Jonathan Frankle

In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.

Trade-offs of Local SGD at Scale: An Empirical Study

no code implementations15 Oct 2021 Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos, Nicolas Ballas

As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time.

Image Classification

Mode Connectivity and Sparse Neural Networks

no code implementations25 Sep 2019 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

We observe that these subnetworks match the accuracy of the full network only when two SGD runs for the same subnetwork are connected by linear paths with the no change in test error.

Strengthening Subcommunities: Towards Sustainable Growth in AI Research

no code implementations18 Apr 2022 Andi Peng, Jessica Zosa Forde, Yonadav Shavit, Jonathan Frankle

AI's rapid growth has been felt acutely by scholarly venues, leading to growing pains within the peer review process.

Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates

1 code implementation2 Jun 2022 Jacob Portes, Davis Blalock, Cory Stephenson, Jonathan Frankle

Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive.

Benchmarking

Non-Determinism and the Lawlessness of Machine Learning Code

no code implementations23 Jun 2022 A. Feder Cooper, Jonathan Frankle, Christopher De Sa

In this paper, we clarify the overlap and differences between these two concepts, and show that the effects of non-determinism, and consequently its implications for the law, become clearer from the perspective of reasoning about ML outputs as distributions over possible outcomes.

Legal Reasoning

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

no code implementations6 Oct 2022 Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite

Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP.

Pruning's Effect on Generalization Through the Lens of Training and Regularization

no code implementations25 Oct 2022 Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite

Pruning models in this over-parameterized regime leads to a contradiction -- while theory predicts that reducing model size harms generalization, pruning to a range of sparsities nonetheless improves it.

Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

no code implementations1 Nov 2022 Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong, Matthew L. Leavitt

We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100).

Image Classification Language Modelling +1

The Effect of Data Dimensionality on Neural Network Prunability

no code implementations1 Dec 2022 Zachary Ankner, Alex Renda, Gintare Karolina Dziugaite, Jonathan Frankle, Tian Jin

Practitioners prune neural networks for efficiency gains and generalization improvements, but few scrutinize the factors determining the prunability of a neural network the maximum fraction of weights that pruning can remove without compromising the model's test accuracy.

Knowledge Distillation for Efficient Sequences of Training Runs

no code implementations11 Mar 2023 Xingyu Liu, Alex Leonardi, Lu Yu, Chris Gilmer-Hill, Matthew Leavitt, Jonathan Frankle

We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD.

Knowledge Distillation

Dynamic Masking Rate Schedules for MLM Pretraining

no code implementations24 May 2023 Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%.

Language Modelling Masked Language Modeling +1

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

no code implementations31 Dec 2023 Nikhil Sardana, Jonathan Frankle

We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand.

Language Modelling Large Language Model

Dataset Difficulty and the Role of Inductive Bias

no code implementations3 Jan 2024 Devin Kwok, Nikhil Anand, Jonathan Frankle, Gintare Karolina Dziugaite, David Rolnick

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset.

Inductive Bias

Cannot find the paper you are looking for? You can Submit a new open access paper.