Omnivore: A Single Model for Many Visual Modalities

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

PDF Abstract

Results from the Paper


 Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition EPIC-KITCHENS-100 OMNIVORE (Swin-B, finetuned) Action@1 49.9 # 1
Verb@1 69.5 # 3
Noun@1 61.7 # 2
Image Classification ImageNet Omnivore (Swin-B) Top 1 Accuracy 85.3% # 115
Top 5 Accuracy 97.5% # 23
Image Classification ImageNet Omnivore (Swin-L) Top 1 Accuracy 86.0% # 83
Top 5 Accuracy 97.7% # 17
Image Classification iNaturalist 2018 OMNIVORE (Swin-L) Top-1 Accuracy 84.1% # 6
Action Classification Kinetics-400 OMNIVORE (Swin-B) Vid acc@1 84.0 # 15
Vid acc@5 96.2 # 12
Action Classification Kinetics-400 OMNIVORE (Swin-L) Vid acc@1 84.1 # 14
Vid acc@5 96.1 # 13
Semantic Segmentation NYU Depth v2 OMNIVORE (Swin-B, finetuned) Mean IoU 55.1% # 5
Semantic Segmentation NYU Depth v2 OMNIVORE (Swin-L, finetuned) Mean IoU 56.8% # 2
Action Recognition Something-Something V2 OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain) Top-1 Accuracy 71.4 # 6
Top-5 Accuracy 93.5 # 5
Scene Recognition SUN-RGBD OMNIVORE (Swin-B) Accuracy (%) 67.2 # 1

Methods


No methods listed for this paper. Add relevant methods here