24 code implementations • ICLR 2019 • Jonathan Frankle, Michael Carbin
Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.
3 code implementations • 5 Mar 2019 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
With this change, it finds small subnetworks of deeper networks (e. g., 80% sparsity on Resnet-50) that can complete the training process to match the accuracy of the original network on more challenging tasks (e. g., ImageNet).
2 code implementations • ICML 2020 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e. g., random data order and augmentation).
1 code implementation • ICLR 2020 • Jonathan Frankle, David J. Schwab, Ari S. Morcos
We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset.
4 code implementations • ICLR 2021 • Jonathan Frankle, David J. Schwab, Ari S. Morcos
A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features.
1 code implementation • 25 Oct 2023 • Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov
This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce.
1 code implementation • 27 Mar 2024 • Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, Christopher D. Manning
Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks.
1 code implementation • NeurIPS 2023 • Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle
Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
1 code implementation • 6 Mar 2020 • Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag
Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years.
2 code implementations • NeurIPS 2020 • Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin
For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity.
2 code implementations • ICLR 2020 • Alex Renda, Jonathan Frankle, Michael Carbin
Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding.
1 code implementation • CVPR 2021 • Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang
We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance.
1 code implementation • 2 Jun 2022 • Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite
A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network.
no code implementations • 29 Jun 2019 • Jonathan Frankle, David Bau
Namely, we consider the effect of removing unnecessary structure on the number of hidden units that learn disentangled representations of human-recognizable concepts as identified by network dissection.
no code implementations • 18 Jun 2020 • Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit
We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task.
no code implementations • ICLR 2021 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
Recent work has explored the possibility of pruning neural networks at initialization.
no code implementations • 13 Oct 2020 • Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos
Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations.
no code implementations • NeurIPS Workshop DL-IG 2020 • Jonathan Frankle
We revisit and extend the experiments of Goodfellow et al. (2014), who showed that - for then state-of-the-art networks - "the objective function has a simple, approximately convex shape" along the linear path between initialization and the trained weights.
no code implementations • 30 Apr 2021 • Rajiv Movva, Jonathan Frankle, Michael Carbin
Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy.
no code implementations • 30 Jun 2021 • Tiffany Vlaar, Jonathan Frankle
In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.
no code implementations • 15 Oct 2021 • Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos, Nicolas Ballas
As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time.
no code implementations • 25 Sep 2019 • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
We observe that these subnetworks match the accuracy of the full network only when two SGD runs for the same subnetwork are connected by linear paths with the no change in test error.
no code implementations • 18 Apr 2022 • Andi Peng, Jessica Zosa Forde, Yonadav Shavit, Jonathan Frankle
AI's rapid growth has been felt acutely by scholarly venues, leading to growing pains within the peer review process.
1 code implementation • 2 Jun 2022 • Jacob Portes, Davis Blalock, Cory Stephenson, Jonathan Frankle
Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive.
no code implementations • 23 Jun 2022 • A. Feder Cooper, Jonathan Frankle, Christopher De Sa
In this paper, we clarify the overlap and differences between these two concepts, and show that the effects of non-determinism, and consequently its implications for the law, become clearer from the perspective of reasoning about ML outputs as distributions over possible outcomes.
no code implementations • 6 Oct 2022 • Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite
Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP.
no code implementations • 25 Oct 2022 • Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite
Pruning models in this over-parameterized regime leads to a contradiction -- while theory predicts that reducing model size harms generalization, pruning to a range of sparsities nonetheless improves it.
no code implementations • 1 Nov 2022 • Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong, Matthew L. Leavitt
We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100).
no code implementations • 1 Dec 2022 • Zachary Ankner, Alex Renda, Gintare Karolina Dziugaite, Jonathan Frankle, Tian Jin
Practitioners prune neural networks for efficiency gains and generalization improvements, but few scrutinize the factors determining the prunability of a neural network the maximum fraction of weights that pruning can remove without compromising the model's test accuracy.
no code implementations • 11 Mar 2023 • Xingyu Liu, Alex Leonardi, Lu Yu, Chris Gilmer-Hill, Matthew Leavitt, Jonathan Frankle
We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD.
no code implementations • 24 May 2023 • Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt
Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%.
no code implementations • 31 Dec 2023 • Nikhil Sardana, Jonathan Frankle
We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand.
no code implementations • 3 Jan 2024 • Devin Kwok, Nikhil Anand, Jonathan Frankle, Gintare Karolina Dziugaite, David Rolnick
Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset.