There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al.  extended the A2D  and J-HMDB  datasets with natural sentences; the datasets focus on describing the ‘actors’ and ‘actions’ appearing in videos, therefore the instance annotations are limited to only a few object categories corresponding to the dominant ‘actors’ performing a salient ‘action’. Khoreva et al.  built a dataset based on DAVIS , but the scales are barely sufficient to learn an end-to-end model from scratch
Youtube-VOS has 4,519 high-resolution videos with 94 common object categories. Each video has pixel-level instance segmentation annotation at every 5 frames in 30-fps videos, and their durations are around 3 to 6 seconds.
We employed Amazon Mechanical Turk to annotate referring expressions. To ensure the quality of the annotations, we selected around 50 turkers after a validation test. Each turker was given a pair of videos, the original video and the mask-overlaid one with the target object highlighted, and was asked to provide a discriminative sentence within 20 words that describes the target object accurately. We collected two kinds of annotations, which describe the highlighted object (1) based on a whole video (Full-video expression) and (2) using only the first frame of the video (First-frame expression). After the initial annotation, we conducted verification and cleaning jobs for all annotations, and dropped objects if an object cannot be localized using language expressions only.
The followings are the statistics and analysis of the two annotation types of the dataset after the verification.
Full-video expression: Youtube-VOS has 6,459 and 1,063 unique objects in train and validation split, respectively. Among them, we cover 6,388 unique objects in 3,471 videos (6, 388/6, 459 = 98.9%) with 12,913 expressions in train split and 1,063 unique objects in 507 videos (1, 063/1, 063 = 100%) with 2,096 expressions in validation split. On average, each video has 3.8 language expressions and each expression has 10.0 words.
First-frame expression: There are 6,006 unique objects in 3,412 videos (6, 006 /6, 459 = 93.0%) with 10,897 expressions in train split and 1,030 unique objects in 507 videos (1, 030/1, 063 = 96.9%) with 1,993 expressions in validation split. The number of annotated objects is lower than that of the full-video expressions because using only the first frame makes annotation more ambiguous and inconsistent and we dropped more annotations during the verification. On average, each video has 3.2 language expressions and each expression has 7.5 words.