Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries

3 Apr 2020  ·  Hao Wang, Cheng Deng, Fan Ma, Yi Yang ·

Actor and action video segmentation with language queries aims to segment out the expression referred objects in the video. This process requires comprehensive language reasoning and fine-grained video understanding. Previous methods mainly leverage dynamic convolutional networks to match visual and semantic representations. However, the dynamic convolution neglects spatial context when processing each region in the frame and is thus challenging to segment similar objects in the complex scenarios. To address such limitation, we construct a context modulated dynamic convolutional network. Specifically, we propose a context modulated dynamic convolutional operation in the proposed framework. The kernels for the specific region are generated from both language sentences and surrounding context features. Moreover, we devise a temporal encoder to incorporate motions into the visual features to further match the query descriptions. Extensive experiments on two benchmark datasets, Actor-Action Dataset Sentences (A2D Sentences) and J-HMDB Sentences, demonstrate that our proposed approach notably outperforms state-of-the-art methods.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation A2D Sentences CMDy Precision@0.5 0.607 # 17
Precision@0.9 0.045 # 20
IoU overall 0.623 # 18
IoU mean 0.531 # 16
Precision@0.6 0.525 # 19
Precision@0.7 0.405 # 19
Precision@0.8 0.235 # 19
AP 0.333 # 16
Referring Expression Segmentation J-HMDB CMDy Precision@0.5 0.742 # 14
Precision@0.6 0.587 # 15
Precision@0.7 0.316 # 15
Precision@0.8 0.047 # 16
Precision@0.9 0.000 # 11
AP 0.301 # 10
IoU overall 0.554 # 16
IoU mean 0.576 # 13

Methods