Refer-YouTube-VOS

Introduced by Seo et al. in URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al. [6] extended the A2D [33] and J-HMDB [9] datasets with natural sentences; the datasets focus on describing the ‘actors’ and ‘actions’ appearing in videos, therefore the instance annotations are limited to only a few object categories corresponding to the dominant ‘actors’ performing a salient ‘action’. Khoreva et al. [10] built a dataset based on DAVIS [25], but the scales are barely sufficient to learn an end-to-end model from scratch

Youtube-VOS has 4,519 high-resolution videos with 94 common object categories. Each video has pixel-level instance segmentation annotation at every 5 frames in 30-fps videos, and their durations are around 3 to 6 seconds.

We employed Amazon Mechanical Turk to annotate referring expressions. To ensure the quality of the annotations, we selected around 50 turkers after a validation test. Each turker was given a pair of videos, the original video and the mask-overlaid one with the target object highlighted, and was asked to provide a discriminative sentence within 20 words that describes the target object accurately. We collected two kinds of annotations, which describe the highlighted object (1) based on a whole video (Full-video expression) and (2) using only the first frame of the video (First-frame expression). After the initial annotation, we conducted verification and cleaning jobs for all annotations, and dropped objects if an object cannot be localized using language expressions only.

The followings are the statistics and analysis of the two annotation types of the dataset after the verification.

Full-video expression: Youtube-VOS has 6,459 and 1,063 unique objects in train and validation split, respectively. Among them, we cover 6,388 unique objects in 3,471 videos (6, 388/6, 459 = 98.9%) with 12,913 expressions in train split and 1,063 unique objects in 507 videos (1, 063/1, 063 = 100%) with 2,096 expressions in validation split. On average, each video has 3.8 language expressions and each expression has 10.0 words.

First-frame expression: There are 6,006 unique objects in 3,412 videos (6, 006 /6, 459 = 93.0%) with 10,897 expressions in train split and 1,030 unique objects in 507 videos (1, 030/1, 063 = 96.9%) with 1,993 expressions in validation split. The number of annotated objects is lower than that of the full-video expressions because using only the first frame makes annotation more ambiguous and inconsistent and we dropped more annotations during the verification. On average, each video has 3.2 language expressions and each expression has 7.5 words.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	GLEE-Pro
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Pro
Referring Expression Segmentation	Refer-YouTube-VOS	RefVOS-Human REs

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Similar Datasets

MeViS

Google Refexp

Referring Expressions for DAVIS 2016 & 2017

A2D

Usage

Refer-YouTube-VOS

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

MeViS

Google Refexp

Referring Expressions for DAVIS 2016 & 2017

A2D

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages