Drones shooting can be applied in dynamic traffic monitoring, object detecting and tracking, and other vision tasks.
Moreover, the adjacency matrix can be self-learned for better embedding performance when the original graph structure is incomplete.
Recently, lots of deep networks are proposed to improve the quality of predicted super-resolution (SR) images, due to its widespread use in several image-based fields.
Deep learning-based hyperspectral image super-resolution (SR) methods have achieved great success recently.
In the last decade, crowd counting and localization attract much attention of researchers due to its wide-spread applications, including crowd monitoring, public safety, space design, etc.
Clustering is an effective technique in data mining to group a set of objects in terms of some attributes.
Particularly, MetaL-TDVS aims to excavate the latent mechanism for summarizing video by reformulating video summarization as a meta learning problem and promote generalization ability of the trained model.
This work aims at solving the problems with intractable sparsity-inducing norms that are often encountered in various machine learning tasks, such as multi-task learning, subspace clustering, feature selection, robust principal component analysis, and so on.
Exploiting different representations, or views, of the same object for better clustering has become very popular these days, which is conventionally called multi-view clustering.
To make full use of these information, this paper attempt to exploit the text guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language.
Crowd counting from a single image is a challenging task due to high appearance similarity, perspective changes and severe congestion.
In this paper, a novel locality and structure regularized low rank representation (LSLRR) model is proposed for HSI classification.
In order to better handle high dimension problem and explore abundance information, this paper presents a General End-to-end Two-dimensional CNN (GETNET) framework for hyperspectral image change detection (HSI-CD).
Band selection, by choosing a set of representative bands in hyperspectral image (HSI), is an effective method to reduce the redundant information without compromising the original contents.
Compared to traditional RNNs, H-RNN is more suitable to video summarization, since it can exploit long temporal dependency among frames, meanwhile, the computation operations are significantly lessened.
This is due to that the CLF's influence function has a upper bound which can alleviate the influence of a single sample, especially the sample with a large noise, on estimating the residuals.
To address this problem, we make the first attempt to view weather recognition as a multi-label classification task, i. e., assigning an image more than one labels according to the displayed weather conditions.
This paper presents new algorithms to solve the feature-sparsity constrained PCA problem (FSPCA), which performs feature selection and PCA simultaneously.
In this paper, we propose a model with 3-gated model which fuses the global and local image features together for the task of image caption generation.
In this paper, we propose a weakly supervised adversarial domain adaptation to improve the segmentation performance from synthetic data to real scenes, which consists of three deep neural networks.
different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.
Moreover, considering the importance of the discriminative information underlying in the initial clustering results, we add a discriminative constraint into our proposed objective function.
In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speedup the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries.
Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.
Experimental results on the VOC2007 and VOC2012 datasets demonstrate that the proposed TripleNet is able to improve both the detection and segmentation accuracies without adding extra computational costs.
Ranked #26 on Semantic Segmentation on PASCAL VOC 2012 test
Many classic methods solve the domain adaptation problem by establishing a common latent space, which may cause the loss of many important properties across both domains.
Although video summarization has achieved great success in recent years, few approaches have realized the influence of video structure on the summarization results.
In this paper, we propose a multi-branch and high-level semantic network by gradually splitting a base network into multiple different branches.
Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression.
Ranked #6 on Scene Text Detection on ICDAR 2013
Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.
This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, the output is a keyshot sequence.
Ranked #4 on Video Summarization on TvSum
Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion.
In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables.
Given the explosive growth of online videos, it is becoming increasingly important to relieve the tedious work of browsing and managing the video content of interest.
Based on the new Cohesion Measurement, a novel object discovery method is proposed to discover objects latent in an image by utilizing the eigenvectors of the affinity matrix.
To achieve this goal, we cast the problem into a constrained rank minimization framework by adopting the least squares regularization.
Recently, audiovisual speech recognition based the MRBM has attracted much attention, and the MRBM shows its effectiveness in learning the joint representation across audiovisual modalities.
Co-saliency detection is a newly emerging and rapidly growing research area in computer vision community.
For example, CNN classifies these proposals by the full-connected layer features while proposal scores and the features in the inner-layers of CNN are ignored.
Ranked #23 on Pedestrian Detection on Caltech
The main purpose of this article is to provide a comprehensive study and an updated review on sparse representation and to supply a guidance for researchers.
Finally, we propose to combine both non-neighboring and neighboring features for pedestrian detection.
Ranked #26 on Pedestrian Detection on Caltech
Our DISC framework is capable of uniformly highlighting the objects-of-interest from complex background while preserving well object details.
So there are undesirable false alarms and missed alarms in many algorithms of moving object detection.
Multistage particle windows (MPW), proposed by Gualdi et al., is an algorithm of fast and accurate object detection.
iCascade searches the optimal number ri of weak classifiers of each stage i by directly minimizing the computation cost of the cascade.
In this paper, we propose a new approach to overcome the representation and matching problems in age invariant face recognition.
CLM-based methods consist of a shape model and a number of local experts, each of which is utilized to detect a facial feature point.