InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

PDF Abstract

Results from the Paper


 Ranked #1 on Video Retrieval on DiDeMo (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval ActivityNet InternVideo text-to-video R@1 62.2 # 1
video-to-text R@1 62.8 # 1
Zero-Shot Video Retrieval ActivityNet InternVideo text-to-video R@1 30.7 # 2
video-to-text R@1 31.4 # 2
Spatio-Temporal Action Localization AVA-Kinetics InternVideo val mAP 42.51 # 1
Action Recognition AVA v2.2 InternVideo mAP 41.01 # 1
Video Retrieval DiDeMo InternVideo text-to-video R@1 57.9 # 1
video-to-text R@1 59.1 # 1
Zero-Shot Video Retrieval DiDeMo InternVideo text-to-video R@1 31.5 # 3
video-to-text R@1 33.5 # 1
Action Classification Kinetics-400 InternVideo-T Acc@1 91.1 # 1
Action Classification Kinetics-600 InternVideo-T Top-1 Accuracy 91.3 # 3
Action Classification Kinetics-700 InternVideo-T Top-1 Accuracy 84.0 # 1
Zero-Shot Video Retrieval LSMDC InternVideo text-to-video R@1 17.6 # 2
video-to-text R@1 13.2 # 1
Video Retrieval LSMDC InternVideo text-to-video R@1 34.0 # 3
video-to-text R@1 34.9 # 2
Video Retrieval MSR-VTT InternVideo text-to-video R@1 55.2 # 1
video-to-text R@1 57.9 # 3
Zero-Shot Video Retrieval MSR-VTT InternVideo text-to-video R@1 40.7 # 2
video-to-text R@1 39.6 # 1
Visual Question Answering MSRVTT-QA InternVideo Accuracy 0.471 # 3
Video Retrieval MSVD InternVideo text-to-video R@1 58.4 # 2
video-to-text R@1 76.3 # 1
Zero-Shot Video Retrieval MSVD InternVideo text-to-video R@1 43.4 # 1
video-to-text R@1 67.6 # 1
Visual Question Answering MSVD-QA InternVideo Accuracy 0.555 # 5
Action Recognition Something-Something V1 InternVideo Top 1 Accuracy 70.0 # 1
Action Recognition Something-Something V2 InternVideo Top-1 Accuracy 77.2 # 1
Visual Question Answering TGIF-QA InternVideo Accuracy 0.722 # 2
Open Set Action Recognition UCF101-MiTv2 InternVideo AUROC 91.85 # 1
Open Set Action Recognition UCF-HMDB InternVideo AUROC 85.48 # 1
Zero-Shot Video Retrieval VATEX InternVideo text-to-video R@1 49.5 # 2
video-to-text R@1 69.5 # 2
Video Retrieval VATEX InternVideo text-to-video R@1 71.1 # 1

Methods