End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Segmentation COIN CBT Frame accuracy 53.9 # 9
Action Segmentation COIN MIL-NCE Frame accuracy 61.0 # 7
Zero-Shot Video Retrieval MSR-VTT MIL-NCE text-to-video R@1 9.9 # 30
text-to-video R@5 24.0 # 29
text-to-video R@10 32.4 # 28
text-to-video Mean Rank 29.5 # 3
Action Recognition RareAct HT100M S3D mWAP 30.5 # 3
Long Video Retrieval (Background Removed) YouCook2 MIL-NCE Cap. Avg. R@1 43.1 # 6
Cap. Avg. R@5 68.6 # 6
Cap. Avg. R@10 79.1 # 6
Zero-Shot Video Retrieval YouCook2 MIL-NCE text-to-video R@1 15.1 # 5
text-to-video R@5 38.0 # 5
text-to-video R@10 51.2 # 5
text-to-video Mean Rank 10 # 2

Methods


No methods listed for this paper. Add relevant methods here