Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

PDF Abstract

Results from the Paper


 Ranked #1 on Image Classification on ImageNet V2 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification ImageNet Model soups (ViT-G/14) Top 1 Accuracy 90.94% # 4
Number of params 1843M # 966
Image Classification ImageNet Model soups (BASIC-L) Top 1 Accuracy 90.98% # 3
Number of params 2440M # 972
Domain Generalization ImageNet-A Model soups (BASIC-L) Top-1 accuracy % 94.17 # 1
Domain Generalization ImageNet-A Model soups (ViT-G/14) Top-1 accuracy % 92.67 # 2
Unsupervised Domain Adaptation ImageNet-R Model soups (ViT-G/14) Top 1 Error 4.54 # 1
Domain Generalization ImageNet-R Model soups (BASIC-L) Top-1 Error Rate 3.90 # 1
Domain Generalization ImageNet-R Model soups (ViT-G/14) Top-1 Error Rate 4.54 # 2
Image Classification ImageNet ReaL Model soups (ViT-G/14) Accuracy 91.20% # 2
Params 1843M # 55
Image Classification ImageNet ReaL Model soups (BASIC-L) Accuracy 91.03% # 7
Params 2440M # 56
Image Classification ImageNet ReaL Baseline (ViT-G/14) Accuracy 91.78% # 1
Domain Generalization ImageNet-Sketch Model soups (ViT-G/14) Top-1 accuracy 74.24 # 2
Domain Generalization ImageNet-Sketch Model soups (BASIC-L) Top-1 accuracy 77.18 # 1
Image Classification ImageNet V2 Model soups (ViT-G/14) Top 1 Accuracy 84.22 # 3
Image Classification ImageNet V2 Model soups (BASIC-L) Top 1 Accuracy 84.63 # 1
Out-of-Distribution Generalization ImageNet-W Uniform Soup (ViT-B/32) IN-W Gap -7.9 # 1
Carton Gap +24 # 1
Out-of-Distribution Generalization ImageNet-W Greedy Soup (ViT-B/32) IN-W Gap -6.5 # 1
Carton Gap +16 # 1
Image Classification ObjectNet Baseline (ViT-G/14) Top-1 Accuracy 79.03 # 5
Image Classification ObjectNet Model soups (ViT-G/14) Top-1 Accuracy 78.52 # 6

Methods