InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks, including Moment Queries, Natural Language Queries, Future Hand Prediction, State Change Object Detection, and Short-term Object Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks with simple head designs. In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of InternVideo as a video foundation model. Our code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Moment Queries Ego4D InternVideo Avg mAP (0.1-0.5) 23.59 # 2
Recall 41.13 # 3
State Change Object Detection Ego4D InternVideo AP 37.19 # 1
AP50 55.97 # 1
AP75 38.44 # 1
Short-term Object Interaction Anticipation Ego4D InternVideo Overall (Top5 mAP) 3.4 # 2
Noun (Top5 mAP) 24.6 # 1
Noun+Verb(Top5 mAP) 9.18 # 2
Noun+TTC (Top5 mAP) 7.64 # 1
Future Hand Prediction Ego4D InternVideo M.Disp(Left) 43.25 # 1
C.Disp(Left) 53.33 # 1
M.Disp(Right) 46.25 # 1
C.Disp(Right) 53.37 # 1
Disp(Total) 196.8 # 1
Natural Language Queries Ego4D InternVideo R@1 IoU=0.3 16.45 # 1
R@5 IoU=0.3 22.95 # 3
R@1 IoU=0.5 10.06 # 1
R@5 IoU=0.5 16.10 # 3
R@1 Mean(0.3 and 0.5) 13.26 # 1

Methods