Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

PDF Abstract

Results from the Paper

 Ranked #1 on Image Classification on iNaturalist 2019 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 Hiera-H (K700 PT+FT) mAP 43.3 # 2
Object Detection COCO minival Hiera-L box AP 55 # 47
Instance Segmentation COCO minival Heira-L mask AP 48.6 # 30
Image Classification ImageNet Hiera-H Top 1 Accuracy 86.9% # 108
Image Classification iNaturalist Hiera-H (448px) Top 1 Accuracy 83.8 # 1
Image Classification iNaturalist 2018 Hiera-H (448px) Top-1 Accuracy 87.3% # 4
Image Classification iNaturalist 2019 Hiera-H (448px) Top-1 Accuracy 88.5 # 1
Action Classification Kinetics-400 Hiera-H (no extra data) Acc@1 87.8 # 18
Action Classification Kinetics-600 Hiera-H (no extra data) Top-1 Accuracy 88.8 # 15
Action Classification Kinetics-700 Hiera-H (no extra data) Top-1 Accuracy 81.1 # 8
Image Classification Places365-Standard Hiera-H (448px) Top 1 Accuracy 60.6 # 2
Action Recognition Something-Something V2 Hiera-L (no extra data) Top-1 Accuracy 76.5 # 5