HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

antoine77340/milnce_howto100m ICCV 2019

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

papermsucode/mdmmt 19 Mar 2021

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Revitalize Region Feature for Democratizing Video-Language Pre-training

showlab/demovlp 15 Mar 2022

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language tasks.

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

elad-amrani/ssml 6 Mar 2020

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.

Condensed Movies: Story Based Retrieval with Contextual Embeddings

m-bain/CondensedMovies 8 May 2020

Our objective in this work is long range understanding of the narrative structure of movies.

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

albanie/collaborative-experts 3 Aug 2020

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

jayleicn/ClipBERT CVPR 2021

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.