Referring Expressions for DAVIS 2016 & 2017

Introduced by Khoreva et al. in Video Object Segmentation with Language Referring Expressions

Our task is to localize and provide a pixel-level mask of an object on all video frames given a language referring expression obtained either by looking at the first frame only or the full video. To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. These two datasets introduce various challenges, containing videos with single or multiple salient objects, crowded scenes, similar looking instances, occlusions, camera view changes, fast motion, etc.

DAVIS16 [38] consists of 30 training and 20 test videos of diverse object categories with all frames annotated with pixel-level accuracy. Note that in this dataset only a single object is annotated per video. For the multiple object video segmentation task we consider DAVIS17. Compared to DAVIS16, this is a more challenging dataset, with multiple objects annotated per video and more complex scenes with more distractors, occlusions, smaller objects, and fine structures. Overall, DAVIS17 consists of a training set with 60 videos, and a validation/test-dev/test-challenge set with 30 sequences each.

As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in DAVIS16 and DAVIS17 with non-ambiguous referring expressions. We follow the work of [34] and ask the annotator to provide a language description of the object, which has a mask annotation, by looking only at the first frame of the video. Then another annotator is given the first frame and the corresponding description, and asked to identify the referred object. If the annotator is unable to correctly identify the object, the description is corrected to remove ambiguity and to specify the object uniquely. We have collected two referring expressions per target object annotated by non-computer vision experts (Annotator 1, 2).

However, by looking only at the 1st frame, the obtained referring expressions may potentially be invalid for an entire video. (We actually quantified that only∼ 15% of the collected descriptions become invalid over time and it does not affect strongly segmentation results as temporal consistency step helps to disambiguate some of such cases, see the supp. material for details.) Besides, in many applications, such as video editing or video-based advertisement, the user has access to a full video. Providing a language query which is valid for all frames might decrease the editing time and result in more coherent predictions. Thus, on DAVIS17 we asked the workers to provide a description of the object by looking at the full video. We have collected one expression of the full video type per target object. Future work may choose to use either setting.

The average length for the first frame/full video expressions is 5.5/6.3 words. For DAVIS17 first frame annotations we notice that descriptions given by Annotator 1 are longer than the ones by Annotator 2 (6.4 vs. 4.6 words). We evaluate the effect of description length on the grounding performance in §5. Besides, the expressions relevant to a full video mention verbs more often than the first frame descriptions (44% vs. 25%). This is intuitive, as referring to an object which changes its appearance and position over time may require mentioning its actions. Adjectives are present in over 50% for all annotations. Most of them refer to colors (over 70%), shapes and sizes (7%) and spatial/ordering words (6% first frame vs. 13% full video expressions). The full video expressions also have a higher number of adverbs and prepositions, and overall are more complex than the ones provided for the first frame.

Overall augmented DAVIS16/17 contains ∼ 1.2k referring expressions for more than 400 objects on 150 videos with ∼ 10k frames. We believe the collected data will be of interest to segmentation as well as vision and language communities, providing an opportunity to explore language as alternative input for video object segmentation.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	Cutie+
Video Object Segmentation	DAVIS 2017 (val)	XMem
Referring Expression Segmentation	DAVIS 2017 (val)	UNINEXT-H
Unsupervised Video Object Segmentation	DAVIS 2017 (val)	DEVA
Referring Expression Segmentation	Referring Expressions for DAVIS 2016 & 2017	MUTR