HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

ICCV 2019 Antoine MiechDimitri ZhukovJean-Baptiste AlayracMakarand TapaswiIvan LaptevJosef Sivic

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale... (read more)

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Temporal Action Localization CrossTask Text-Video Embedding Recall 33.6 # 1
Video Retrieval LSMDC Text-Video Embedding text-to-video [email protected] 7.2 # 5
text-to-video [email protected] 19.6 # 4
text-to-video [email protected] 27.9 # 4
text-to-video Median Rank 40 # 4
Video Retrieval MSR-VTT Text-Video Embedding text-to-video [email protected] 14.9 # 1
text-to-video [email protected] 52.8 # 1
text-to-video Median Rank 9 # 1
video-to-text [email protected] 40.2 # 2
Video Retrieval YouCook2 Text-Video Embedding text-to-video Median Rank 24 # 1
text-to-video [email protected] 8.2 # 1
text-to-video [email protected] 35.3 # 1
text-to-video [email protected] 24.5 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet