5 dataset results for segmentation AND Referring Expression Segmentation

DAVIS 2017

DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 for validation, 60 for testing

270 PAPERS • 11 BENCHMARKS

Referring Expressions for DAVIS 2016 & 2017

…To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. For the multiple object video segmentation task we consider DAVIS17. As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in DAVIS16 and DAVIS17 with non-ambiguous referring expressions. (We actually quantified that only∼ 15% of the collected descriptions become invalid over time and it does not affect strongly segmentation results as temporal consistency step helps to disambiguate some We believe the collected data will be of interest to segmentation as well as vision and language communities, providing an opportunity to explore language as alternative input for video object segmentation

75 PAPERS • 5 BENCHMARKS

RefCOCO

…In this game, the first player views an image with a segmented target object and writes a natural language expression referring to that object. These datasets serve as valuable resources for tasks like referring expression segmentation, comprehension, and visual grounding in computer vision research.

301 PAPERS • 19 BENCHMARKS

A2D Sentences (Sentences for the Actor-Action Dataset (A2D))

The Actor-Action Dataset (A2D) by Xu et al. [29] serves as the largest video dataset for the general actor and action segmentation task. As we are interested in pixel-level actor and action segmentation from sentences, we augment the videos in A2D with natural language descriptions about what each actor is doing in the videos.

29 PAPERS • 1 BENCHMARK

Refer-YouTube-VOS

There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Each video has pixel-level instance segmentation annotation at every 5 frames in 30-fps videos, and their durations are around 3 to 6 seconds.

34 PAPERS • 3 BENCHMARKS

Datasets

5 dataset results for segmentation AND Referring Expression Segmentation