TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	ActivityNet	InternVideo	text-to-video R@1	30.7	# 10
Zero-Shot Video Retrieval	ActivityNet	InternVideo	video-to-text R@1	31.4	# 8
Video Retrieval	ActivityNet	InternVideo	text-to-video R@1	62.2	# 7
Video Retrieval	ActivityNet	InternVideo	video-to-text R@1	62.8	# 4
Temporal Action Localization	ActivityNet-1.3	InternVideo	mAP	39.00	# 8
Spatio-Temporal Action Localization	AVA-Kinetics	InternVideo	val mAP	41.01	# 3
Action Recognition	AVA v2.2	InternVideo	mAP	41.01	# 6
Zero-Shot Video Retrieval	DiDeMo	InternVideo	text-to-video R@1	31.5	# 15
Zero-Shot Video Retrieval	DiDeMo	InternVideo	text-to-video R@5	57.6	# 15
Zero-Shot Video Retrieval	DiDeMo	InternVideo	text-to-video R@10	68.2	# 15
Zero-Shot Video Retrieval	DiDeMo	InternVideo	video-to-text R@1	33.5	# 7
Zero-Shot Video Retrieval	DiDeMo	InternVideo	video-to-text R@5	60.3	# 7
Zero-Shot Video Retrieval	DiDeMo	InternVideo	video-to-text R@10	71.1	# 7
Video Retrieval	DiDeMo	InternVideo	text-to-video R@1	57.9	# 9
Video Retrieval	DiDeMo	InternVideo	video-to-text R@1	59.1	# 4
Zero-Shot Video Question Answer	EgoSchema (fullset)	InternVideo	Accuracy	32.1	# 7
Temporal Action Localization	FineAction	InternVideo	mAP	17.57	# 4
Temporal Action Localization	HACS	InternVideo	Average-mAP	41.55	# 5
Action Classification	Kinetics-400	InternVideo	Acc@1	91.1	# 3
Action Classification	Kinetics-600	InternVideo-T	Top-1 Accuracy	91.3	# 5
Action Classification	Kinetics-700	InternVideo-T	Top-1 Accuracy	84.0	# 3
Video Retrieval	LSMDC	InternVideo	text-to-video R@1	34.0	# 8
Video Retrieval	LSMDC	InternVideo	video-to-text R@1	34.9	# 4
Zero-Shot Video Retrieval	LSMDC	InternVideo	text-to-video R@1	17.6	# 7
Zero-Shot Video Retrieval	LSMDC	InternVideo	video-to-text R@1	13.2	# 4
Zero-Shot Video Retrieval	LSMDC	InternVideo	text-to-video R@5	32.4	# 7
Zero-Shot Video Retrieval	LSMDC	InternVideo	text-to-video R@10	40.2	# 7
Zero-Shot Video Retrieval	LSMDC	InternVideo	video-to-text R@5	27.8	# 4
Zero-Shot Video Retrieval	LSMDC	InternVideo	video-to-text R@10	34.9	# 4
Zero-Shot Video Retrieval	MSR-VTT	InternVideo	text-to-video R@1	40.7	# 10
Zero-Shot Video Retrieval	MSR-VTT	InternVideo	video-to-text R@1	39.6	# 4
Video Retrieval	MSR-VTT	InternVideo	text-to-video R@1	55.2	# 7
Video Retrieval	MSR-VTT	InternVideo	video-to-text R@1	57.9	# 6
Visual Question Answering (VQA)	MSRVTT-QA	InternVideo	Accuracy	0.471	# 6
Zero-Shot Video Retrieval	MSVD	InternVideo	text-to-video R@1	43.4	# 9
Zero-Shot Video Retrieval	MSVD	InternVideo	video-to-text R@1	67.6	# 7
Video Retrieval	MSVD	InternVideo	text-to-video R@1	58.4	# 3
Video Retrieval	MSVD	InternVideo	video-to-text R@1	76.3	# 3
Visual Question Answering (VQA)	MSVD-QA	InternVideo	Accuracy	0.555	# 12
Zero-Shot Video Question Answer	NExT-QA	InternVideo	Accuracy	49.1	# 13
Action Recognition	Something-Something V1	InternVideo	Top 1 Accuracy	70.0	# 1
Action Recognition	Something-Something V2	InternVideo	Top-1 Accuracy	77.2	# 3
Zero-Shot Video Question Answer	STAR Benchmark	InternVideo	Accuracy	41.6	# 4
Zero-Shot Video Question Answer	STAR Benchmark	InternVideo	Accuracy	41.6	# 3
Video Question Answering	STAR Benchmark	InternVideo	Average Accuracy	58.7	# 4
Visual Question Answering (VQA)	TGIF-QA	InternVideo	Accuracy	0.722	# 2
Temporal Action Localization	THUMOS’14	ActionFormer (InternVideo features)	Avg mAP (0.3:0.7)	71.58	# 4
Zero-Shot Video Question Answer	TVQA	InternVideo	Accuracy	35.9	# 5
Open Set Action Recognition	UCF101-MiTv2	InternVideo	AUROC	91.85	# 1
Open Set Action Recognition	UCF-HMDB	InternVideo	AUROC	85.48	# 1
Video Retrieval	VATEX	InternVideo	text-to-video R@1	71.1	# 5
Video Retrieval	VATEX	InternVideo	video-to-text R@1	87.2	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo	text-to-video R@1	49.5	# 4
Zero-Shot Video Retrieval	VATEX	InternVideo	video-to-text R@1	69.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/open-set-action-recognition-on-ucf101-mitv2)](https://paperswithcode.com/sota/open-set-action-recognition-on-ucf101-mitv2?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/open-set-action-recognition-on-ucf-hmdb)](https://paperswithcode.com/sota/open-set-action-recognition-on-ucf-hmdb?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/visual-question-answering-on-tgif-qa)](https://paperswithcode.com/sota/visual-question-answering-on-tgif-qa?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/spatio-temporal-action-localization-on-ava)](https://paperswithcode.com/sota/spatio-temporal-action-localization-on-ava?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-question-answer-on-star)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-question-answer-on-star-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star-1?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-question-answering-on-situated)](https://paperswithcode.com/sota/video-question-answering-on-situated?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-question-answer-on-tvqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-tvqa?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=internvideo-general-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo-general-video-foundation-models/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=internvideo-general-video-foundation-models)`

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

6 Dec 2022 · Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao ·

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

PDF Abstract

Code

Add Remove Mark official

opengvlab/internvideo official

921

Tasks

Add Remove

Action Classification

Action Recognition

Contrastive Learning

Open Set Action Recognition

Spatio-Temporal Action Localization

Temporal Action Localization

Video Question Answering

Video Retrieval

Video Understanding

Visual Question Answering (VQA)

Zero-Shot Video Question Answer

Zero-Shot Video Retrieval

Datasets

UCF101

Kinetics

HMDB51

ActivityNet

Kinetics 400

MSR-VTT

THUMOS14

MSVD

Something-Something V2

HowTo100M

DiDeMo

WebVid

Kinetics-600

TVQA

Something-Something V1

LSMDC

VATEX

AVA

Kinetics-700

TGIF-QA

NExT-QA

HACS MSRVTT-QA MSVD-QA

TGIF

VLN-CE EgoSchema

FineAction

STAR Benchmark

Results from the Paper

Edit

Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	ActivityNet	InternVideo	text-to-video R@1	30.7	# 10	Compare
Zero-Shot Video Retrieval	ActivityNet	InternVideo	video-to-text R@1	31.4	# 8	Compare
Video Retrieval	ActivityNet	InternVideo	text-to-video R@1	62.2	# 7	Compare
Video Retrieval	ActivityNet	InternVideo	video-to-text R@1	62.8	# 4	Compare
Temporal Action Localization	ActivityNet-1.3	InternVideo	mAP	39.00	# 8	Compare
Spatio-Temporal Action Localization	AVA-Kinetics	InternVideo	val mAP	41.01	# 3	Compare
Action Recognition	AVA v2.2	InternVideo	mAP	41.01	# 6	Compare
Zero-Shot Video Retrieval	DiDeMo	InternVideo	text-to-video R@1	31.5	# 15	Compare
			text-to-video R@5	57.6	# 15	Compare
			text-to-video R@10	68.2	# 15	Compare
			video-to-text R@1	33.5	# 7	Compare
			video-to-text R@5	60.3	# 7	Compare
			video-to-text R@10	71.1	# 7	Compare
Video Retrieval	DiDeMo	InternVideo	text-to-video R@1	57.9	# 9	Compare
Video Retrieval	DiDeMo	InternVideo	video-to-text R@1	59.1	# 4	Compare
Zero-Shot Video Question Answer	EgoSchema (fullset)	InternVideo	Accuracy	32.1	# 7	Compare
Temporal Action Localization	FineAction	InternVideo	mAP	17.57	# 4	Compare
Temporal Action Localization	HACS	InternVideo	Average-mAP	41.55	# 5	Compare
Action Classification	Kinetics-400	InternVideo	Acc@1	91.1	# 3	Compare
Action Classification	Kinetics-600	InternVideo-T	Top-1 Accuracy	91.3	# 5	Compare
Action Classification	Kinetics-700	InternVideo-T	Top-1 Accuracy	84.0	# 3	Compare
Video Retrieval	LSMDC	InternVideo	text-to-video R@1	34.0	# 8	Compare
Video Retrieval	LSMDC	InternVideo	video-to-text R@1	34.9	# 4	Compare
Zero-Shot Video Retrieval	LSMDC	InternVideo	text-to-video R@1	17.6	# 7	Compare
			video-to-text R@1	13.2	# 4	Compare
			text-to-video R@5	32.4	# 7	Compare
			text-to-video R@10	40.2	# 7	Compare
			video-to-text R@5	27.8	# 4	Compare
			video-to-text R@10	34.9	# 4	Compare
Zero-Shot Video Retrieval	MSR-VTT	InternVideo	text-to-video R@1	40.7	# 10	Compare
Zero-Shot Video Retrieval	MSR-VTT	InternVideo	video-to-text R@1	39.6	# 4	Compare
Video Retrieval	MSR-VTT	InternVideo	text-to-video R@1	55.2	# 7	Compare
Video Retrieval	MSR-VTT	InternVideo	video-to-text R@1	57.9	# 6	Compare
Visual Question Answering (VQA)	MSRVTT-QA	InternVideo	Accuracy	0.471	# 6	Compare
Zero-Shot Video Retrieval	MSVD	InternVideo	text-to-video R@1	43.4	# 9	Compare
Zero-Shot Video Retrieval	MSVD	InternVideo	video-to-text R@1	67.6	# 7	Compare
Video Retrieval	MSVD	InternVideo	text-to-video R@1	58.4	# 3	Compare
Video Retrieval	MSVD	InternVideo	video-to-text R@1	76.3	# 3	Compare
Visual Question Answering (VQA)	MSVD-QA	InternVideo	Accuracy	0.555	# 12	Compare
Zero-Shot Video Question Answer	NExT-QA	InternVideo	Accuracy	49.1	# 13	Compare
Action Recognition	Something-Something V1	InternVideo	Top 1 Accuracy	70.0	# 1	Compare
Action Recognition	Something-Something V2	InternVideo	Top-1 Accuracy	77.2	# 3	Compare
Zero-Shot Video Question Answer	STAR Benchmark	InternVideo	Accuracy	41.6	# 4	Compare
Zero-Shot Video Question Answer	STAR Benchmark	InternVideo	Accuracy	41.6	# 3	Compare
Video Question Answering	STAR Benchmark	InternVideo	Average Accuracy	58.7	# 4	Compare
Visual Question Answering (VQA)	TGIF-QA	InternVideo	Accuracy	0.722	# 2	Compare
Temporal Action Localization	THUMOS’14	ActionFormer (InternVideo features)	Avg mAP (0.3:0.7)	71.58	# 4	Compare
Zero-Shot Video Question Answer	TVQA	InternVideo	Accuracy	35.9	# 5	Compare
Open Set Action Recognition	UCF101-MiTv2	InternVideo	AUROC	91.85	# 1	Compare
Open Set Action Recognition	UCF-HMDB	InternVideo	AUROC	85.48	# 1	Compare
Video Retrieval	VATEX	InternVideo	text-to-video R@1	71.1	# 5	Compare
Video Retrieval	VATEX	InternVideo	video-to-text R@1	87.2	# 2	Compare
Zero-Shot Video Retrieval	VATEX	InternVideo	text-to-video R@1	49.5	# 4	Compare
Zero-Shot Video Retrieval	VATEX	InternVideo	video-to-text R@1	69.5	# 4	Compare

Methods

Add Remove

Contrastive Learning • InternVideo

Edit Social Preview

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove