This paper focuses on a referring expression generation (REG) task in which the aim is to pick out an object in a complex visual scene.
To mitigate this issue, we propose to generate visual samples based on semantic embeddings using a conditional variational autoencoder (CVAE) model.
Although deep neural networks have achieved great performance on classification tasks, recent studies showed that well trained networks can be fooled by adding subtle noises.
The current video shadow detection method achieves this goal via co-attention, which mostly exploits information that is temporally coherent but is not robust in detecting moving shadows and small shadow regions.
Inspired by physical models of shadow formation, we use a linear illumination transformation to model the shadow effects in the image that allows the shadow image to be expressed as a combination of the shadow-free image, the shadow parameters, and a matte layer.
Ranked #2 on Shadow Removal on Adjusted ISTD
We assume that the distribution of intra-class variance generalizes across the base class and the novel class.
Our method achieves competitive shadow removal results compared to state-of-the-art methods that are trained with fully paired shadow and shadow-free images.
Ranked #4 on Shadow Removal on Adjusted ISTD
This segmentation network is trained with a specific loss function, based on the average activation, to effectively learn from the data with the weakly-annotated labels.
The A-Net modifies the original training images constrained by a simplified physical shadow model and is focused on fooling the D-Net's shadow predictions.
Ranked #4 on Shadow Detection on SBU
This paper proposes a geodesic-distance-based feature that encodes global information for improved video segmentation algorithms.
Ranked #1 on Video Segmentation on SegTrack v2
Co-localization is the problem of localizing objects of the same class using only the set of images that contain them.
Ranked #1 on Object Localization on PASCAL VOC 2007
Video segmentation is the task of grouping similar pixels in the spatio-temporal domain, and has become an important preprocessing step for subsequent video analysis.