There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Each video has pixel-level instance segmentation annotation at every 5 frames in 30-fps videos, and their durations are around 3 to 6 seconds.
34 PAPERS • 3 BENCHMARKS