This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency.
Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #11 on Action Classification on Charades
In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video.
Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction.
Our results show that (i) mistakes on background are substantial and they are responsible for 18-49% of the total error, (ii) models do not generalize well to different kinds of backgrounds and perform poorly on completely background images, and (iii) models make many more mistakes than those captured by the standard Mean Absolute Error (MAE) metric, as counting on background compensates considerably for misses on foreground.
In this way, the proposed network aggregates the context information of a pixel from its semantic-correlated region instead of a predefined fixed region.
Ranked #11 on Semantic Segmentation on COCO-Stuff test
Furthermore, we introduce a “dense skip” architecture to retain a rich set of low-level information from the pre-trained CNN, which is essential to improve the low-level parsing performance.
Recently, segmentation neural networks have been significantly improved by demonstrating very promising accuracies on public benchmarks.
In this paper, we first propose a novel context contrasted local feature that not only leverages the informative context but also spotlights the local information in contrast to the context.
Ranked #14 on Semantic Segmentation on COCO-Stuff test
This paper proposes a new method called Multimodal RNNs for RGB-D scene semantic segmentation.
Our method drives the network to learn a Level Set function for salient objects so it can output more accurate boundaries and compact saliency.
Scene labeling can be seen as a sequence-sequence prediction task (pixels-labels), and it is quite important to leverage relevant context to enhance the performance of pixel classification.
Matching pedestrians across multiple camera views known as human re-identification (re-identification) is a challenging problem in visual surveillance.
Then the Siamese CNN and temporally constrained metrics are jointly learned online to construct the appearance-based tracklet affinity models.
In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing.
In this manuscript, we integrate CNNs with HRNNs, and develop end-to-end convolutional hierarchical recurrent neural networks (C-HRNNs).
In image labeling, local representations for image units are usually generated from their surrounding image patches, thus long-range contextual information is not effectively encoded.
Ranked #16 on Semantic Segmentation on COCO-Stuff test
In order to encode the class correlation and class specific information in image representation, we propose a new local feature learning approach named Deep Discriminative and Shareable Feature Learning (DDSFL).
We adopt Convolutional Neural Networks (CNN) as our parametric model to learn discriminative features and classifiers for local patch classification.