Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-Text Retrieval

Trend	Dataset	Best Model	Paper	Code	Compare
	Test-of-Time	TACT			See all

Libraries

Use these libraries to find Video-Text Retrieval models and implementations

towhee-io/towhee

3 papers

2,991

Datasets

Most implemented papers

Most implemented Social Latest No code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time • • ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Paper
Code

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip • • 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Paper
Code

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

CryhanFang/CLIP2Video • • CVPR 2020

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Paper
Code

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

pku-yuangroup/languagebind • • 3 Oct 2023

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

Paper
Code

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

alibaba/AliceMind • • 24 May 2022

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.

Paper
Code

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

starmemda/camow • • 9 Sep 2021

In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity.

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

tencentarc/mcq • • CVPR 2022

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Paper
Code

Egocentric Video-Language Pretraining

showlab/egovlp • • 3 Jun 2022

Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.

Paper
Code

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

rerv/uniadapter • • 13 Feb 2023

Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49. 7% recall@1 with 2. 2% model parameters, outperforming the latest competitors by 2. 0%.

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Paper
Code

Video-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result