TGIF-Transition
5 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
All in One: Exploring Unified Video-Language Pre-training
In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.
Clover: Towards A Unified Video-Language Alignment and Fusion Model
We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise.
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Masked visual modeling (MVM) has been recently proven effective for visual pre-training.
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning.
Lightweight Recurrent Cross-modal Encoder for Video Question Answering
Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames.