2 dataset results for Referring Video Object Segmentation

There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al. [6] extended the A2D [33] and J-HMDB [9] datasets with natural sentences; the datasets focus on describing the ‘actors’ and ‘actions’ appearing in videos, therefore the instance annotations are limited to only a few object categories corresponding to the dominant ‘actors’ performing a salient ‘action’. Khoreva et al. [10] built a dataset based on DAVIS [25], but the scales are barely sufficient to learn an end-to-end model from scratch

34 PAPERS • 3 BENCHMARKS

MeViS (Motion expressions Video Segmentation)

MeViS is a large-scale dataset for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. The dataset contains numerous motion expressions to indicate target objects in complex environments.

8 PAPERS • 1 BENCHMARK

Datasets

2 dataset results for Referring Video Object Segmentation