TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Temporal Action Localization	CrossTask	Text-Video Embedding	Recall	33.6	# 4
Video Retrieval	LSMDC	Text-Video Embedding	text-to-video R@1	7.2	# 36
Video Retrieval	LSMDC	Text-Video Embedding	text-to-video R@5	19.6	# 31
Video Retrieval	LSMDC	Text-Video Embedding	text-to-video R@10	27.9	# 30
Video Retrieval	LSMDC	Text-Video Embedding	text-to-video Median Rank	40	# 19
Video Retrieval	MSR-VTT	Text-Video Embedding	text-to-video R@1	14.9	# 33
Video Retrieval	MSR-VTT	Text-Video Embedding	text-to-video R@10	52.8	# 28
Video Retrieval	MSR-VTT	Text-Video Embedding	text-to-video Median Rank	9	# 12
Video Retrieval	MSR-VTT	Text-Video Embedding	video-to-text R@5	40.2	# 10
Video Retrieval	MSR-VTT-1kA	HT	text-to-video R@1	12.1	# 55
Video Retrieval	MSR-VTT-1kA	HT	text-to-video R@5	35.0	# 54
Video Retrieval	MSR-VTT-1kA	HT	text-to-video R@10	48.0	# 57
Video Retrieval	MSR-VTT-1kA	HT	text-to-video Median Rank	12	# 37
Video Retrieval	MSR-VTT-1kA	HT-Pretrained	text-to-video R@1	14.9	# 54
Video Retrieval	MSR-VTT-1kA	HT-Pretrained	text-to-video R@5	40.2	# 53
Video Retrieval	MSR-VTT-1kA	HT-Pretrained	text-to-video R@10	52.8	# 56
Video Retrieval	MSR-VTT-1kA	HT-Pretrained	text-to-video Median Rank	9	# 36
Video Retrieval	YouCook2	Text-Video Embedding	text-to-video Median Rank	24	# 7
Video Retrieval	YouCook2	Text-Video Embedding	text-to-video R@1	8.2	# 11
Video Retrieval	YouCook2	Text-Video Embedding	text-to-video R@10	35.3	# 13
Video Retrieval	YouCook2	Text-Video Embedding	text-to-video R@5	24.5	# 10
Long Video Retrieval (Background Removed)	YouCook2	Text-Video Embedding	Cap. Avg. R@1	46.6	# 5
Long Video Retrieval (Background Removed)	YouCook2	Text-Video Embedding	Cap. Avg. R@5	74.3	# 5
Long Video Retrieval (Background Removed)	YouCook2	Text-Video Embedding	Cap. Avg. R@10	83.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/temporal-action-localization-on-crosstask)](https://paperswithcode.com/sota/temporal-action-localization-on-crosstask?p=howto100m-learning-a-text-video-embedding-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/long-video-retrieval-background-removed-on)](https://paperswithcode.com/sota/long-video-retrieval-background-removed-on?p=howto100m-learning-a-text-video-embedding-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=howto100m-learning-a-text-video-embedding-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=howto100m-learning-a-text-video-embedding-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=howto100m-learning-a-text-video-embedding-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/howto100m-learning-a-text-video-embedding-by/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=howto100m-learning-a-text-video-embedding-by)`

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

ICCV 2019 · Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic ·

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract

Code

Add Remove Mark official

antoine77340/MIL-NCE_HowTo100M

207

antoine77340/milnce_howto100m

207

antoine77340/S3D_HowTo100M

↳ Quickstart in

Colab

184

roudimit/AVLnet

Tasks

Add Remove

Action Localization

Long Video Retrieval (Background Removed)

Retrieval

Text to Video Retrieval

Video Retrieval

Datasets

Introduced in the Paper:

HowTo100M

Used in the Paper:

MSR-VTT

DiDeMo

YouCook2

LSMDC

CrossTask

Results from the Paper

Edit

Ranked #4 on Temporal Action Localization on CrossTask

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Temporal Action Localization	CrossTask	Text-Video Embedding	Recall	33.6	# 4	Compare
Video Retrieval	LSMDC	Text-Video Embedding	text-to-video R@1	7.2	# 36	Compare
			text-to-video R@5	19.6	# 31	Compare
			text-to-video R@10	27.9	# 30	Compare
			text-to-video Median Rank	40	# 19	Compare
Video Retrieval	MSR-VTT	Text-Video Embedding	text-to-video R@1	14.9	# 33	Compare
			text-to-video R@10	52.8	# 28	Compare
			text-to-video Median Rank	9	# 12	Compare
			video-to-text R@5	40.2	# 10	Compare
Video Retrieval	MSR-VTT-1kA	HT	text-to-video R@1	12.1	# 55	Compare
			text-to-video R@5	35.0	# 54	Compare
			text-to-video R@10	48.0	# 57	Compare
			text-to-video Median Rank	12	# 37	Compare
Video Retrieval	MSR-VTT-1kA	HT-Pretrained	text-to-video R@1	14.9	# 54	Compare
			text-to-video R@5	40.2	# 53	Compare
			text-to-video R@10	52.8	# 56	Compare
			text-to-video Median Rank	9	# 36	Compare
Video Retrieval	YouCook2	Text-Video Embedding	text-to-video Median Rank	24	# 7	Compare
			text-to-video R@1	8.2	# 11	Compare
			text-to-video R@10	35.3	# 13	Compare
			text-to-video R@5	24.5	# 10	Compare
Long Video Retrieval (Background Removed)	YouCook2	Text-Video Embedding	Cap. Avg. R@1	46.6	# 5	Compare
			Cap. Avg. R@5	74.3	# 5	Compare
			Cap. Avg. R@10	83.7	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove