RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Referring Expression Segmentation A2Dre test RefVos Overall IoU 47.5 # 1
Mean IoU 33.2 # 1
Referring Expression Segmentation A2D Sentences RefVOS Precision@0.5 49.5 # 3
Precision@0.9 6.4 # 1
IoU overall 59.9 # 2
IoU mean 59.9 # 1
Referring Expression Segmentation DAVIS 2017 (val) RefVOS J&F 1st frame 44.5 # 1
J&F Full video 45.1 # 1
Referring Expression Segmentation RefCOCO testA RefVos with Bi-LSTM IoU 52.90 # 8
Referring Expression Segmentation RefCOCO testA RefVOS with BERT Pre-train IoU 63.19 # 3
Referring Expression Segmentation RefCOCO+ testA RefVOS with BERT + MLM Loss Overall IoU 49.73 # 5
Referring Expression Segmentation RefCOCO testB RefVOS with BERT Pre-train IoU 54.17 # 5
Referring Expression Segmentation RefCOCO+ test B RefVOS with BERT + MLM loss Overall IoU 36.17 # 6
Referring Expression Segmentation RefCoCo val RefVOS with BERT Pre-train IoU 58.65 # 5
Referring Expression Segmentation RefCoCo val RefVOS with BERT + MLM loss IoU 59.45 # 3
Referring Expression Segmentation RefCOCO+ val RefVOS with BERT + MLM loss Overall IoU 44.71 # 5

Methods used in the Paper


METHOD TYPE
1x1 Convolution
Convolutions
Convolution
Convolutions
Residual Connection
Skip Connections
Weight Decay
Regularization
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
WordPiece
Subword Segmentation
Adam
Stochastic Optimization
Dropout
Regularization
Softmax
Output Functions
Dense Connections
Feedforward Networks
GELU
Activation Functions
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
BERT
Language Models
Dilated Convolution
Convolutions
Grouped Convolution
Convolutions
Multiscale Dilated Convolution Block
Image Model Blocks