TGIF-Frame

11 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in TGIF-Frame

Trend	Dataset	Best Model	Paper	Code	Compare
	TGIF-QA	COSA			See all

Datasets

TGIF-QA

Most implemented papers

Most implemented Social Latest No code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Paper
Code

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM • • 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

showlab/all-in-one • • CVPR 2023

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Paper
Code

Clover: Towards A Unified Video-Language Alignment and Fusion Model

leeyn-43/clover • • CVPR 2023

We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise.

Paper
Code

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

tsujuifu/pytorch_empirical-mvm • • CVPR 2023

Masked visual modeling (MVM) has been recently proven effective for visual pre-training.

Paper
Code

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

mlvlab/MELTR • • CVPR 2023

Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning.

Paper
Code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR • • 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast • • NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Paper
Code

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

txh-mercury/cosa • • 15 Jun 2023

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.

Paper
Code

Lightweight Recurrent Cross-modal Encoder for Video Question Answering

Sejong-VLI/VQA-LRCE-KBS-2023 • • Knowledge-Based Systems 2023

Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames.

Paper
Code

TGIF-Frame

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result