Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions.
Correlation clustering, or multicut partitioning, is widely used in image segmentation for partitioning an undirected graph or image with positive and negative edge weights such that the sum of cut edge weights is minimized.
Deformable part models (DPMs) and convolutional neural networks (CNNs) are two widely used tools for visual recognition.
Building on the observation that foreground areas are surrounded by the regions with high spatiotemporal edge values, geodesic distance provides an initial estimation for foreground and background.
Image representations, from SIFT and Bag of Visual Words to Convolutional Neural Networks (CNNs), are a crucial component of almost any image understanding system.
We propose to train the parameters of the filters and the influence functions through a loss based approach.
Visual features are of vital importance for human action understanding in videos.