Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
To enrich language models with domain knowledge is crucial but difficult.
Semi-supervised video object segmentation is an interesting yet challenging task in machine learning.
In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information.