TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Zero-Shot Video Question Answer
|
ActivityNet-QA
|
VideoChat2
|
Confidence Score
|
3.3
|
# 13
|
|
Zero-Shot Video Question Answer
|
ActivityNet-QA
|
VideoChat2
|
Accuracy
|
49.1
|
# 14
|
|
Video Question Answering
|
ActivityNet-QA
|
VideoChat2
|
Accuracy
|
49.1
|
# 8
|
|
Video Question Answering
|
ActivityNet-QA
|
VideoChat2
|
Confidence score
|
3.3
|
# 2
|
|
Zero-Shot Video Question Answer
|
EgoSchema (fullset)
|
VideoChat2_mistral
|
Accuracy
|
54.4
|
# 11
|
|
Zero-Shot Video Question Answer
|
EgoSchema (fullset)
|
VideoChat2_HD_mistral
|
Accuracy
|
55.8
|
# 10
|
|
Zero-Shot Video Question Answer
|
EgoSchema (fullset)
|
VideoChat2_phi3
|
Accuracy
|
56.7
|
# 9
|
|
Zero-Shot Video Question Answer
|
EgoSchema (subset)
|
VideoChat2_mistral
|
Accuracy
|
63.6
|
# 6
|
|
Zero-Shot Video Question Answer
|
EgoSchema (subset)
|
VideoChat2_HD_mistral
|
Accuracy
|
65.6
|
# 5
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_HD_mistral
|
Accuarcy
|
83.4
|
# 1
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_HD_mistral
|
CW
|
84.0
|
# 1
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_HD_mistral
|
CH
|
90.0
|
# 1
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_HD_mistral
|
TP&TN
|
77.3
|
# 2
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_mistral
|
Accuarcy
|
81.9
|
# 2
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_mistral
|
CW
|
82.6
|
# 2
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_mistral
|
CH
|
86.9
|
# 2
|
|
Video Question Answering
|
IntentQA
|
VideoChat2_mistral
|
TP&TN
|
77.0
|
# 3
|
|
Zero-Shot Video Question Answer
|
MSRVTT-QA
|
VideoChat2
|
Accuracy
|
54.1
|
# 23
|
|
Zero-Shot Video Question Answer
|
MSRVTT-QA
|
VideoChat2
|
Confidence Score
|
3.3
|
# 14
|
|
Zero-Shot Video Question Answer
|
MSVD-QA
|
VideoChat2
|
Accuracy
|
70.0
|
# 18
|
|
Zero-Shot Video Question Answer
|
MSVD-QA
|
VideoChat2
|
Confidence Score
|
3.9
|
# 9
|
|
Video Question Answering
|
MVBench
|
VideoChat2
|
Avg.
|
51.9
|
# 12
|
|
Video Question Answering
|
NExT-QA
|
VideoChat2_mistral
|
Accuracy
|
78.6
|
# 6
|
|
Video Question Answering
|
NExT-QA
|
VideoChat2
|
Accuracy
|
68.6
|
# 19
|
|
Zero-Shot Video Question Answer
|
NExT-QA
|
VideoChat2
|
Accuracy
|
61.7
|
# 18
|
|
Video Question Answering
|
NExT-QA
|
VideoChat2_HD_mistral
|
Accuracy
|
79.5
|
# 3
|
|
Zero-Shot Video Question Answer
|
STAR Benchmark
|
VideoChat2
|
Accuracy
|
59.0
|
# 1
|
|
Video Question Answering
|
TVBench
|
VideoChat2
|
Average Accuracy
|
35.0
|
# 20
|
|
Zero-Shot Video Question Answer
|
TVQA
|
VideoChat_mistral (no speech)
|
Accuracy
|
46.4
|
# 5
|
|
Zero-Shot Video Question Answer
|
TVQA
|
VideoChat_HD_mistral (no speech)
|
Accuracy
|
50.6
|
# 4
|
|
Zero-Shot Video Question Answer
|
TVQA
|
VideoChat2 (no speech)
|
Accuracy
|
40.6
|
# 6
|
|
Zero-Shot Learning
|
TVQA
|
VideoChat2
|
Accuracy
|
40.6
|
# 1
|
|
Video-based Generative Performance Benchmarking (Correctness of Information)
|
VideoInstruct
|
VideoChat2_HD_mistral
|
gpt-score
|
3.40
|
# 5
|
|
Video-based Generative Performance Benchmarking (Detail Orientation))
|
VideoInstruct
|
VideoChat2_HD_mistral
|
gpt-score
|
2.86
|
# 12
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
Correctness of Information
|
3.02
|
# 11
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
Detail Orientation
|
2.88
|
# 14
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
Contextual Understanding
|
3.51
|
# 11
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
Temporal Understanding
|
2.66
|
# 10
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
Consistency
|
2.81
|
# 10
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2
|
mean
|
2.98
|
# 15
|
|
Video-based Generative Performance Benchmarking (Temporal Understanding)
|
VideoInstruct
|
VideoChat2_HD_mistral
|
gpt-score
|
2.65
|
# 8
|
|
Video-based Generative Performance Benchmarking (Contextual Understanding)
|
VideoInstruct
|
VideoChat2_HD_mistral
|
gpt-score
|
3.64
|
# 7
|
|
Video-based Generative Performance Benchmarking (Consistency)
|
VideoInstruct
|
VideoChat2_HD_mistral
|
gpt-score
|
2.62
|
# 10
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
mean
|
2.20
|
# 3
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Correctness of Information
|
2.13
|
# 5
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Detail Orientation
|
2.42
|
# 4
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Contextual Understanding
|
2.51
|
# 4
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Temporal Understanding
|
1.66
|
# 2
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Consistency
|
2.27
|
# 4
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Dense Captioning
|
1.26
|
# 3
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Spatial Understanding
|
2.43
|
# 2
|
|
VCGBench-Diverse
|
VideoInstruct
|
VideoChat2
|
Reasoning
|
3.13
|
# 6
|
|
Video-based Generative Performance Benchmarking (Consistency)
|
VideoInstruct
|
VideoChat2
|
gpt-score
|
2.81
|
# 6
|
|
Video-based Generative Performance Benchmarking (Temporal Understanding)
|
VideoInstruct
|
VideoChat2
|
gpt-score
|
2.66
|
# 7
|
|
Video-based Generative Performance Benchmarking (Contextual Understanding)
|
VideoInstruct
|
VideoChat2
|
gpt-score
|
3.51
|
# 9
|
|
Video-based Generative Performance Benchmarking (Detail Orientation))
|
VideoInstruct
|
VideoChat2
|
gpt-score
|
2.88
|
# 11
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
Correctness of Information
|
3.40
|
# 4
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
Detail Orientation
|
2.91
|
# 12
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
Contextual Understanding
|
3.72
|
# 7
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
Temporal Understanding
|
2.65
|
# 11
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
Consistency
|
2.84
|
# 9
|
|
Video-based Generative Performance Benchmarking
|
VideoInstruct
|
VideoChat2_HD_mistral
|
mean
|
3.10
|
# 10
|
|
Video-based Generative Performance Benchmarking (Correctness of Information)
|
VideoInstruct
|
VideoChat2
|
gpt-score
|
3.02
|
# 9
|
|