Instead, we propose a novel method to encode all background objects in each image by using one fixed-size vector (i. e., FBE vector).
It is challenging for weakly supervised object detection network to precisely predict the positions of the objects, since there are no instance-level category annotations.
In order to optimize models and meet the requirement mentioned above, we propose a method that replaces the fully-connected layers of convolution neural network models with a tree classifier.
In image classification, Convolutional Neural Network(CNN) models have achieved high performance with the rapid development in deep learning.
Focusing on discriminate spatiotemporal feature learning, we propose Information Fused Temporal Transformation Network (IF-TTN) for action recognition on top of popular Temporal Segment Network (TSN) framework.
The proposed FSN can make dense predictions at frame-level for a video clip using both spatial and temporal context information.
Most of previous methods mainly consider to drop features from input data and hidden layers, such as Dropout, Cutout and DropBlocks.
On the contrary, the regularization term learned via discriminative approaches are usually trained for a specific image restoration problem, and fail in the problem for which it is not trained.
A newly proposed work exploits Convolutional-Deconvolutional-Convolutional (CDC) filters to upsample the predictions of 3D ConvNets, making it possible to perform per-frame action predictions and achieving promising performance in terms of temporal action localization.
In order to preserve the expected property that end-to-end training is available, we exploit the NSS prior by a set of non-local filters, and derive our proposed trainable non-local reaction diffusion (TNLRD) model for image denoising.
Object detection is an import task of computer vision. A variety of methods have been proposed, but methods using the weak labels still do not have a satisfactory result. In this paper, we propose a new framework that using the weakly supervised method's output as the pseudo-strong labels to train a strongly supervised model. One weakly supervised method is treated as black-box to generate class-specific bounding boxes on train dataset. A de-noise method is then applied to the noisy bounding boxes. Then the de-noised pseudo-strong labels are used to train a strongly object detection network. The whole framework is still weakly supervised because the entire process only uses the image-level labels. The experiment results on PASCAL VOC 2007 prove the validity of our framework, and we get result 43. 4% on mean average precision compared to 39. 5% of the previous best result and 34. 5% of the initial method, respectively. And this frame work is simple and distinct, and is promising to be applied to other method easily.
This study focuses on human recognition with gait feature obtained by Kinect and shows that gait feature can effectively distinguish from different human beings through a novel representation -- relative distance-based gait features.