Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm.
Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.
This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting.
First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations.
Structured pruning is a popular method for compressing a neural network: given a large trained network, one alternates between removing channel connections and fine-tuning; reducing the overall width of the network.
Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resource-constrained devices.
Network pruning is aimed at imposing sparsity in a neural network architecture by increasing the portion of zero-valued weights for reducing its size regarding energy-efficiency consideration and increasing evaluation speed.
Recent developments in deep learning with application to language modeling have led to success in tasks of text processing, summarizing and machine translation.