Search Results for author: Jonathan Frankle

Found 33 papers, 14 papers with code

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

24 code implementations • ICLR 2019 • Jonathan Frankle, Michael Carbin

Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.

Network Pruning

704

Paper
Code

Stabilizing the Lottery Ticket Hypothesis

3 code implementations • 5 Mar 2019 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

With this change, it finds small subnetworks of deeper networks (e. g., 80% sparsity on Resnet-50) that can complete the training process to match the accuracy of the original network on more challenging tasks (e. g., ImageNet).

619

Paper
Code

Linear Mode Connectivity and the Lottery Ticket Hypothesis

2 code implementations • ICML 2020 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e. g., random data order and augmentation).

Linear Mode Connectivity

619

Paper
Code

The Early Phase of Neural Network Training

1 code implementation • ICLR 2020 • Jonathan Frankle, David J. Schwab, Ari S. Morcos

We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset.

619

Paper
Code

Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs

4 code implementations • ICLR 2021 • Jonathan Frankle, David J. Schwab, Ari S. Morcos

A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features.

Style Transfer

619

Paper
Code

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images

1 code implementation • 25 Oct 2023 • Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov

This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce.

Transfer Learning

609

Paper
Code

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

1 code implementation • 27 Mar 2024 • Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, Christopher D. Manning

Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks.

Language Modelling Medical Genetics +3

578

Paper
Code

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

1 code implementation • NeurIPS 2023 • Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle

Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.

Language Modelling Masked Language Modeling

416

Paper
Code

What is the State of Neural Network Pruning?

1 code implementation • 6 Mar 2020 • Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag

Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years.

Network Pruning

410

Paper
Code

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

2 code implementations • NeurIPS 2020 • Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity.

Language Modelling Masked Language Modeling

134

Paper
Code

Comparing Rewinding and Fine-tuning in Neural Network Pruning

2 code implementations • ICLR 2020 • Alex Renda, Jonathan Frankle, Michael Carbin

Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding.

Network Pruning

Paper
Code

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

1 code implementation • CVPR 2021 • Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance.

Paper
Code

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

1 code implementation • 2 Jun 2022 • Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite

A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network.

Paper
Code

Dissecting Pruned Neural Networks

no code implementations • 29 Jun 2019 • Jonathan Frankle, David Bau

Namely, we consider the effect of removing unnecessary structure on the number of hidden units that learn disentangled representations of human-recognizable concepts as identified by network dissection.

Paper
Add Code

On the Predictability of Pruning Across Scales

no code implementations • 18 Jun 2020 • Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit

We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task.

Paper
Add Code

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

no code implementations • ICLR 2021 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

Recent work has explored the possibility of pruning neural networks at initialization.

Paper
Add Code

Are all negatives created equal in contrastive instance discrimination?

no code implementations • 13 Oct 2020 • Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations.

Image Classification Self-Supervised Learning

Paper
Add Code

Revisiting "Qualitatively Characterizing Neural Network Optimization Problems"

no code implementations • NeurIPS Workshop DL-IG 2020 • Jonathan Frankle

We revisit and extend the experiments of Goodfellow et al. (2014), who showed that - for then state-of-the-art networks - "the objective function has a simple, approximately convex shape" along the linear path between initialization and the trained weights.

Paper
Add Code

Studying the Consistency and Composability of Lottery Ticket Pruning Masks

no code implementations • 30 Apr 2021 • Rajiv Movva, Jonathan Frankle, Michael Carbin

Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy.

Paper
Add Code

What can linear interpolation of neural network loss landscapes tell us?

no code implementations • 30 Jun 2021 • Tiffany Vlaar, Jonathan Frankle

In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.

Paper
Add Code

Trade-offs of Local SGD at Scale: An Empirical Study

no code implementations • 15 Oct 2021 • Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos, Nicolas Ballas

As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time.

Image Classification

Paper
Add Code

Mode Connectivity and Sparse Neural Networks

no code implementations • 25 Sep 2019 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

We observe that these subnetworks match the accuracy of the full network only when two SGD runs for the same subnetwork are connected by linear paths with the no change in test error.

Paper
Add Code

Strengthening Subcommunities: Towards Sustainable Growth in AI Research

no code implementations • 18 Apr 2022 • Andi Peng, Jessica Zosa Forde, Yonadav Shavit, Jonathan Frankle

AI's rapid growth has been felt acutely by scholarly venues, leading to growing pains within the peer review process.

Paper
Add Code

Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates

1 code implementation • 2 Jun 2022 • Jacob Portes, Davis Blalock, Cory Stephenson, Jonathan Frankle

Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive.

Benchmarking

Paper
Code

Non-Determinism and the Lawlessness of Machine Learning Code

no code implementations • 23 Jun 2022 • A. Feder Cooper, Jonathan Frankle, Christopher De Sa

In this paper, we clarify the overlap and differences between these two concepts, and show that the effects of non-determinism, and consequently its implications for the law, become clearer from the perspective of reasoning about ML outputs as distributions over possible outcomes.

Legal Reasoning

Paper
Add Code

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

no code implementations • 6 Oct 2022 • Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite

Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP.

Paper
Add Code

Pruning's Effect on Generalization Through the Lens of Training and Regularization

no code implementations • 25 Oct 2022 • Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite

Pruning models in this over-parameterized regime leads to a contradiction -- while theory predicts that reducing model size harms generalization, pruning to a range of sparsities nonetheless improves it.

Paper
Add Code

Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

no code implementations • 1 Nov 2022 • Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong, Matthew L. Leavitt

We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100).

Image Classification Language Modelling +1

Paper
Add Code

The Effect of Data Dimensionality on Neural Network Prunability

no code implementations • 1 Dec 2022 • Zachary Ankner, Alex Renda, Gintare Karolina Dziugaite, Jonathan Frankle, Tian Jin

Practitioners prune neural networks for efficiency gains and generalization improvements, but few scrutinize the factors determining the prunability of a neural network the maximum fraction of weights that pruning can remove without compromising the model's test accuracy.

Paper
Add Code

Knowledge Distillation for Efficient Sequences of Training Runs

no code implementations • 11 Mar 2023 • Xingyu Liu, Alex Leonardi, Lu Yu, Chris Gilmer-Hill, Matthew Leavitt, Jonathan Frankle

We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD.

Knowledge Distillation

Paper
Add Code

Dynamic Masking Rate Schedules for MLM Pretraining

no code implementations • 24 May 2023 • Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%.

Language Modelling Masked Language Modeling +1

Paper
Add Code

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

no code implementations • 31 Dec 2023 • Nikhil Sardana, Jonathan Frankle

We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand.

Language Modelling Large Language Model

Paper
Add Code

Dataset Difficulty and the Role of Inductive Bias

no code implementations • 3 Jan 2024 • Devin Kwok, Nikhil Anand, Jonathan Frankle, Gintare Karolina Dziugaite, David Rolnick

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset.

Inductive Bias

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.