MeViS is a large-scale dataset for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects.
8 PAPERS • 1 BENCHMARK
There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Each video has pixel-level instance segmentation annotation at every 5 frames in 30-fps videos, and their durations are around 3 to 6 seconds.
35 PAPERS • 3 BENCHMARKS