InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

PDF Abstract

Results from the Paper


 Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Retrieval ActivityNet InternVideo text-to-video R@1 30.7 # 10
video-to-text R@1 31.4 # 8
Video Retrieval ActivityNet InternVideo text-to-video R@1 62.2 # 7
video-to-text R@1 62.8 # 4
Temporal Action Localization ActivityNet-1.3 InternVideo mAP 39.00 # 8
Spatio-Temporal Action Localization AVA-Kinetics InternVideo val mAP 41.01 # 3
Action Recognition AVA v2.2 InternVideo mAP 41.01 # 6
Zero-Shot Video Retrieval DiDeMo InternVideo text-to-video R@1 31.5 # 15
text-to-video R@5 57.6 # 15
text-to-video R@10 68.2 # 15
video-to-text R@1 33.5 # 7
video-to-text R@5 60.3 # 7
video-to-text R@10 71.1 # 7
Video Retrieval DiDeMo InternVideo text-to-video R@1 57.9 # 9
video-to-text R@1 59.1 # 4
Zero-Shot Video Question Answer EgoSchema (fullset) InternVideo Accuracy 32.1 # 7
Temporal Action Localization FineAction InternVideo mAP 17.57 # 4
Temporal Action Localization HACS InternVideo Average-mAP 41.55 # 5
Action Classification Kinetics-400 InternVideo Acc@1 91.1 # 3
Action Classification Kinetics-600 InternVideo-T Top-1 Accuracy 91.3 # 5
Action Classification Kinetics-700 InternVideo-T Top-1 Accuracy 84.0 # 3
Video Retrieval LSMDC InternVideo text-to-video R@1 34.0 # 8
video-to-text R@1 34.9 # 4
Zero-Shot Video Retrieval LSMDC InternVideo text-to-video R@1 17.6 # 7
video-to-text R@1 13.2 # 4
text-to-video R@5 32.4 # 7
text-to-video R@10 40.2 # 7
video-to-text R@5 27.8 # 4
video-to-text R@10 34.9 # 4
Zero-Shot Video Retrieval MSR-VTT InternVideo text-to-video R@1 40.7 # 10
video-to-text R@1 39.6 # 4
Video Retrieval MSR-VTT InternVideo text-to-video R@1 55.2 # 7
video-to-text R@1 57.9 # 6
Visual Question Answering (VQA) MSRVTT-QA InternVideo Accuracy 0.471 # 6
Zero-Shot Video Retrieval MSVD InternVideo text-to-video R@1 43.4 # 9
video-to-text R@1 67.6 # 7
Video Retrieval MSVD InternVideo text-to-video R@1 58.4 # 3
video-to-text R@1 76.3 # 3
Visual Question Answering (VQA) MSVD-QA InternVideo Accuracy 0.555 # 12
Zero-Shot Video Question Answer NExT-QA InternVideo Accuracy 49.1 # 13
Action Recognition Something-Something V1 InternVideo Top 1 Accuracy 70.0 # 1
Action Recognition Something-Something V2 InternVideo Top-1 Accuracy 77.2 # 3
Zero-Shot Video Question Answer STAR Benchmark InternVideo Accuracy 41.6 # 4
Accuracy 41.6 # 3
Video Question Answering STAR Benchmark InternVideo Average Accuracy 58.7 # 4
Visual Question Answering (VQA) TGIF-QA InternVideo Accuracy 0.722 # 2
Temporal Action Localization THUMOS’14 ActionFormer (InternVideo features) Avg mAP (0.3:0.7) 71.58 # 4
Zero-Shot Video Question Answer TVQA InternVideo Accuracy 35.9 # 5
Open Set Action Recognition UCF101-MiTv2 InternVideo AUROC 91.85 # 1
Open Set Action Recognition UCF-HMDB InternVideo AUROC 85.48 # 1
Video Retrieval VATEX InternVideo text-to-video R@1 71.1 # 5
video-to-text R@1 87.2 # 2
Zero-Shot Video Retrieval VATEX InternVideo text-to-video R@1 49.5 # 4
video-to-text R@1 69.5 # 4

Methods