TASK
DATASET
MODEL
METRIC NAME
METRIC VALUE
GLOBAL RANK
EXTRA DATA
REMOVE
Action Recognition
ActivityNet
InternVideo2-6B
mAP
95.9
# 3
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
text-to-video R@1
60.4
# 2
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
video-to-text R@1
54.8
# 2
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
text-to-video R@10
90.8
# 3
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
text-to-video R@5
83.9
# 2
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
video-to-text R@5
81.5
# 2
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-1B
video-to-text R@10
89.5
# 2
Video Retrieval
ActivityNet
InternVideo2-6B
text-to-video R@1
74.1
# 1
Video Retrieval
ActivityNet
InternVideo2-6B
video-to-text R@1
69.7
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
text-to-video R@1
63.2
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
video-to-text R@1
56.5
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
text-to-video R@10
92.5
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
text-to-video R@5
85.6
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
video-to-text R@5
82.8
# 1
Zero-Shot Video Retrieval
ActivityNet
InternVideo2-6B
video-to-text R@10
90.3
# 1
Temporal Action Localization
ActivityNet-1.3
InternVideo2-1B
mAP
40.4
# 6
Temporal Action Localization
ActivityNet-1.3
InternVideo2-6B
mAP
41.2
# 5
Text to Audio Retrieval
AudioCaps
InternVideo2-6B
R@1
55.2
# 1
Zero-shot Text to Audio Retrieval
AudioCaps
InternVideo2-6B
Audio-to-text R@1
37.1
# 1
Moment Retrieval
Charades-STA
InternVideo2-1B
R@1 IoU=0.5
68.36
# 6
Moment Retrieval
Charades-STA
InternVideo2-1B
R@1 IoU=0.7
45.03
# 6
Moment Retrieval
Charades-STA
InternVideo2-6B
R@1 IoU=0.5
70.03
# 5
Moment Retrieval
Charades-STA
InternVideo2-6B
R@1 IoU=0.7
48.95
# 5
Text to Audio Retrieval
Clotho
InternVideo2-6B
R@1
27.2
# 2
Zero-shot Text to Audio Retrieval
Clotho
InternVideo2-6B
text-to-audio R@1
17.4
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
text-to-video R@1
57.0
# 2
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
text-to-video R@5
80.0
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
text-to-video R@10
85.1
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
video-to-text R@1
54.3
# 2
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
video-to-text R@5
77.2
# 2
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-1B
video-to-text R@10
83.5
# 3
Video Retrieval
DiDeMo
InternVideo2-6B
text-to-video R@1
74.2
# 1
Video Retrieval
DiDeMo
InternVideo2-6B
video-to-text R@1
71.9
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
text-to-video R@1
57.9
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
text-to-video R@5
80.0
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
text-to-video R@10
84.6
# 2
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
video-to-text R@1
57.1
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
video-to-text R@5
79.9
# 1
Zero-Shot Video Retrieval
DiDeMo
InternVideo2-6B
video-to-text R@10
85.0
# 1
Zero-Shot Video Question Answer
EgoSchema (fullset)
InternVideo2-6B
Accuracy
60.2
# 10
Audio Classification
ESC-50
InternVideo2
Top-1 Accuracy
98.6
# 2
Audio Classification
ESC-50
InternVideo2
PRE-TRAINING DATASET
Multiple
# 1
Audio Classification
ESC-50
InternVideo2
Accuracy (5-fold)
98.6
# 2
Temporal Action Localization
FineAction
InternVideo2-6B
mAP
27.7
# 3
Temporal Action Localization
HACS
InternVideo2-6B
Average-mAP
43.3
# 4
Action Recognition
HACS
InternVideo2-6B
Top 1 Accuracy
97.0
# 1
Temporal Action Localization
HACS
InternVideo2-1B
Average-mAP
42.4
# 6
Action Classification
Kinetics-400
InternVideo2-1B
Acc@1
91.6
# 4
Action Classification
Kinetics-400
InternVideo2-6B
Acc@1
92.1
# 3
Action Classification
Kinetics-600
InternVideo2-6B
Top-1 Accuracy
91.9
# 1
Action Classification
Kinetics-600
InternVideo2-1B
Top-1 Accuracy
91.6
# 3
Action Classification
Kinetics-700
InternVideo2-1B
Top-1 Accuracy
85.4
# 2
Action Classification
Kinetics-700
InternVideo2-6B
Top-1 Accuracy
85.9
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
text-to-video R@1
33.8
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
video-to-text R@1
30.1
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
text-to-video R@5
55.9
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
text-to-video R@10
62.2
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
video-to-text R@5
47.7
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-6B
video-to-text R@10
54.8
# 1
Video Retrieval
LSMDC
InternVideo2-6B
text-to-video R@1
46.4
# 1
Video Retrieval
LSMDC
InternVideo2-6B
video-to-text R@1
46.7
# 1
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
text-to-video R@1
32.0
# 2
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
video-to-text R@1
27.3
# 2
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
text-to-video R@5
52.4
# 2
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
text-to-video R@10
59.4
# 2
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
video-to-text R@5
44.2
# 2
Zero-Shot Video Retrieval
LSMDC
InternVideo2-1B
video-to-text R@10
51.6
# 2
Action Classification
MiT
InternVideo2-1B
Top 1 Accuracy
50.9
# 2
Action Classification
MIT
InternVideo2-6B
Top 1 Accuracy
51.2
# 1
Video Retrieval
MSR-VTT
InternVideo2-6B
text-to-video R@1
62.8
# 3
Video Retrieval
MSR-VTT
InternVideo2-6B
video-to-text R@1
60.2
# 3
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
text-to-video R@1
51.9
# 3
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
text-to-video R@5
75.3
# 2
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
text-to-video R@10
82.5
# 3
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
video-to-text R@1
50.9
# 3
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
video-to-text R@5
73.4
# 3
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-1B
video-to-text R@10
81.8
# 4
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
text-to-video R@1
55.9
# 1
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
text-to-video R@5
78.3
# 1
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
text-to-video R@10
85.1
# 1
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
video-to-text R@1
53.7
# 1
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
video-to-text R@5
77.5
# 1
Zero-Shot Video Retrieval
MSR-VTT
InternVideo2-6B
video-to-text R@10
84.1
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
text-to-video R@1
59.3
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
video-to-text R@1
83.1
# 2
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
text-to-video R@5
84.4
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
text-to-video R@10
89.6
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
video-to-text R@5
94.2
# 2
Zero-Shot Video Retrieval
MSVD
InternVideo2-6B
video-to-text R@10
97.0
# 2
Video Retrieval
MSVD
InternVideo2-6B
text-to-video R@1
61.4
# 1
Video Retrieval
MSVD
InternVideo2-6B
video-to-text R@1
85.2
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
text-to-video R@1
58.1
# 2
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
video-to-text R@1
83.3
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
text-to-video R@5
83.0
# 2
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
text-to-video R@10
88.4
# 2
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
video-to-text R@5
94.3
# 1
Zero-Shot Video Retrieval
MSVD
InternVideo2-1B
video-to-text R@10
96.9
# 3
Zero-Shot Video Question Answer
MVBench
InternVideo2-1B
Accuracy
60.9
# 2
Video Question Answering
MVBench
InternVideo2
Avg.
67.2
# 3
Video Question Answering
Perception Test
InternVideo2 (8B)
Accuracy (Top-1)
63.4
# 3
Moment Retrieval
QVHighlights
InternVideo2-6B
mAP
49.24
# 5
Moment Retrieval
QVHighlights
InternVideo2-6B
R@1 IoU=0.5
71.42
# 4
Moment Retrieval
QVHighlights
InternVideo2-6B
R@1 IoU=0.7
56.45
# 4
Video Grounding
QVHighlights
InternVideo2-6B
R@1,IoU=0.5
71.42
# 1
Video Grounding
QVHighlights
InternVideo2-6B
R@1,IoU=0.7
56.45
# 1
Video Grounding
QVHighlights
InternVideo2-1B
R@1,IoU=0.5
70.00
# 2
Video Grounding
QVHighlights
InternVideo2-1B
R@1,IoU=0.7
54.45
# 2
Action Recognition
Something-Something V2
InternVideo2-6B
Top-1 Accuracy
1
# 118
Action Recognition
Something-Something V2
InternVideo2-6B
Top-5 Accuracy
12
# 87
Action Recognition
Something-Something V2
InternVideo2-6B
Parameters
2131
# 12
Action Recognition
Something-Something V2
InternVideo2-6B
GFLOPs
13321
# 1
Action Recognition
Something-Something V2
InternVideo2-1B
Top-1 Accuracy
77.1
# 3
Temporal Action Localization
THUMOS’14
InternVideo2-6B
Avg mAP (0.3:0.7)
72.0
# 4
Temporal Action Localization
THUMOS’14
InternVideo2-1B
Avg mAP (0.3:0.7)
69.8
# 7
Video Retrieval
VATEX
InternVideo2-6B
text-to-video R@1
75.5
# 4
Video Retrieval
VATEX
InternVideo2-6B
video-to-text R@1
89.3
# 1
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
text-to-video R@1
70.4
# 3
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
video-to-text R@1
85.4
# 1
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
text-to-video R@5
93.4
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
text-to-video R@10
96.9
# 3
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
video-to-text R@5
97.6
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-1B
video-to-text R@10
99.1
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
text-to-video R@1
71.5
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
video-to-text R@1
85.3
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
text-to-video R@5
94.0
# 1
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
text-to-video R@10
97.1
# 2
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
video-to-text R@5
97.9
# 1
Zero-Shot Video Retrieval
VATEX
InternVideo2-6B
video-to-text R@10
99.3
# 1