Uncovering the impact of learning rate for global magnitude pruning
A common paradigm in model pruning is to train a model, prune, and then fine-tune or reinitialize and retrain in the lottery ticket framework. However, prior work has implicitly assumed that the best training configuration for model performance was also the best configuration for mask discovery. In this paper, we propose a simple idea: what if a training configuration which yields worse performance actually yields a better mask? To test this, we decouple the learning rates for mask discovery (LR_find) and mask evaluation (LR_eval). Using a ResNet-50 on Tiny ImageNet, we discovered the counterintuitive ``decoupled LR phenomenon," in which smaller LR_find values lead to models which have lower performance, but generate masks with substantially higher eventual performance compared to using the same learning rate for both stages. We show that this phenomenon holds across a number of models, datasets, configurations, and also for one-shot structured pruning. Finally, we demonstrate that smaller LR_find values yield masks with materially different layerwise pruning ratios and that the decoupled LR phenomenon is causally mediated by these ratios. Our results demonstrate the practical utility of decoupling learning rates and provide clear insights into the mechanisms underlying this counterintuitive effect.
PDF Abstract