UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Video Retrieval on YouCook2 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Video Retrieval MSR-VTT UniVL text-to-video R@1 21.2 # 3
text-to-video R@5 49.6 # 2
text-to-video R@10 63.1 # 1
text-to-video Median Rank 6 # 1
Video Captioning YouCook2 UniVL BLEU-3 23.87 # 1
BLEU-4 17.35 # 1
METEOR 22.35 # 1
ROUGE-L 46.52 # 1
CIDEr 1.81 # 1
Video Retrieval YouCook2 UniVL text-to-video Median Rank 4 # 1
text-to-video R@1 28.9 # 1
text-to-video R@10 70.0 # 1
text-to-video R@5 57.6 # 1

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Weight Decay
Regularization
BPE
Subword Segmentation
GELU
Activation Functions
Dense Connections
Feedforward Networks
Label Smoothing
Regularization
ReLU
Activation Functions
Adam
Stochastic Optimization
WordPiece
Subword Segmentation
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers
BERT
Language Models