In this work, we propose a novel Multi-modal Aggregation Network, named MANet, which is capable of discovering complementary representations from a fully sampled auxiliary modality, with which to hierarchically guide the reconstruction of a given target modality.
Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters.
Bimodal palmprint recognition leverages palmprint and palm vein images simultaneously, which achieves high accuracy by multi-model information fusion and has strong anti-falsification property.
Most existing methods for image inpainting focus on learning the intra-image priors from the known regions of the current input image to infer the content of the corrupted regions in the same image.
With the development of deep encoder-decoder architectures and large-scale annotated medical datasets, great progress has been achieved in the development of automatic medical image segmentation.
Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria.
Since human-labeled samples are free for the target set, unsupervised person re-identification (Re-ID) has attracted much attention in recent years, by additionally exploiting the source set.
In this paper, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a memory enhancement block (MEB) and a high-frequency feature enhancement block (HFFEB) for image super-resolution.
To our best knowledge, it is the largest contactless palmprint image benchmark ever collected with regard to the number of individuals and palms.
In this work we present the Deep-Masking Generative Network (DMGN), which is a unified framework for background restoration from the superimposed images and is able to cope with different types of noise.
The enhancement block gathers and fuses the global and local features to provide complementary information for the latter network.
In this paper, we discuss the techniques for applying EBR to a Facebook Search system.
To address this issue, we propose a non-local operation for context modeling by employing the global similarity within the context.
Deep learning-based models have been very successful in achieving state-of-the-art results in many of the computer vision, speech recognition, and natural language processing tasks in the last few years.
Quantization for deep neural networks have afforded models for edge devices that use less on-board memory and enable efficient low-power inference.
In this paper, we propose a novel Non-negative Sparse and Collaborative Representation (NSCR) for pattern classification.
For the former, we directly apply a CCN to the binarized representation of an image to compute the Bernoulli distribution of each code for entropy estimation.
Therefore, we in this paper investigate the feasibility to remove cosine window from CF trackers with spatial regularization.
The merits of the proposed MCTL are four-fold: 1) the concept of manifold criterion (MC) is first proposed as a measure validating the distribution matching across domains, and domain adaptation is achieved if the MC is satisfied; 2) the proposed MC can well guide the generation of the intermediate domain sharing similar distribution with the target domain, by minimizing the local domain discrepancy; 3) a global generative discrepancy metric (GGDM) is presented, such that both the global and local discrepancy can be effectively and positively reduced; 4) a simplified version of MCTL called MCTL-S is presented under a perfect domain generation assumption for more generic learning scenario.
However, the negative entries in the coefficient matrix are forced to be positive when constructing the affinity matrix via exponentiation, absolute symmetrization, or squaring operations.
Most of existing image denoising methods assume the corrupted noise to be additive white Gaussian noise (AWGN).
Ranked #2 on Denoising on Darmstadt Noise Dataset
The use of sparse representation (SR) and collaborative representation (CR) for pattern classification has been widely studied in tasks such as face recognition and object categorization.
State-of-the-art tone mapping algorithms mostly decompose an image into a base layer and a detail layer, and process them accordingly.
For blind deconvolution, as estimation error of blur kernel is usually introduced, the subsequent non-blind deconvolution process does not restore the latent image well.
In order to promote the study on this problem while implementing the concurrent real-world image denoising datasets, we construct a new benchmark dataset which contains comprehensive real-world noisy images of different natural scenes.
One key issue of arithmetic encoding method is to predict the probability of the current coding symbol from its context, i. e., the preceding encoded symbols, which usually can be executed by building a look-up table (LUT).
The aspect ratio variation frequently appears in visual tracking and has a severe influence on performance.
Most of the existing denoising algorithms are developed for grayscale images, while it is not a trivial work to extend them for color image denoising because the noise statistics in R, G, B channels can be very different for real noisy images.
Ranked #4 on Denoising on Darmstadt Noise Dataset
We propose to exploit the information in both external data and the given noisy image, and develop an external prior guided internal prior learning method for real-world noisy image denoising.
Therefore, the encoder, decoder, binarizer and importance map can be jointly optimized in an end-to-end manner by using a subset of the ImageNet database.
In general, our model consists of a mask network and an attribute transform network which work in synergy to generate a photo-realistic facial image with the reference attribute.
Ranked #2 on Image-to-Image Translation on RaFD
Here we address this problem from the view of optimization, and suggest an optimization model to generate human face with the given attributes while keeping the identity of the reference image.
Person re-identification has been usually solved as either the matching of single-image representation (SIR) or the classification of cross-image representation (CIR).
has proved that tongue, face and sublingual diagnosis as a noninvasive method is a reasonable way for disease detection.
This paper presents a novel quadratic projection based feature extraction framework, where a set of quadratic matrices is learned to distinguish each class from all other classes.
In this paper, we focus on the problem of instrumental variation and time-varying drift in the field of sensors and measurement, which can be viewed as discrete and continuous distributional change in the feature space.
The main purpose of this article is to provide a comprehensive study and an updated review on sparse representation and to supply a guidance for researchers.
For character detection, we use the HSC features instead of using the Histograms of Oriented Gradients (HOG) features.
PGs are extracted from training images by putting nonlocal similar patches into groups, and a PG based Gaussian Mixture Model (PG-GMM) learning algorithm is developed to learn the NSS prior.
In this report, we have discussed the nearest neighbor, support vector machines and extreme learning machines for image classification under deep convolutional activation feature representation.
This paper proposes a unified framework, referred to as Domain Adaptation Extreme Learning Machine (DAELM), which learns a robust classifier by leveraging a limited number of labeled data from target domain for drift compensation as well as gases recognition in E-nose systems, without loss of the computational efficiency and learning ability of traditional ELM.
This paper studies visual understanding via a newly proposed l_2-norm based multi-feature shared learning framework, which can simultaneously learn a global label matrix and multiple sub-classifiers with the labeled multi-feature data.
It allows us to learn a category transformation and an ELM classifier with random projection by minimizing the l_(2, 1)-norm of the network output weights and the learning error simultaneously.
Conventional extreme learning machines solve a Moore-Penrose generalized inverse of hidden layer activated matrix and analytically determine the output weights to achieve generalized performance, by assuming the same loss from different types of misclassification.
Distance metric learning aims to learn from the given training data a valid distance metric, with which the similarity between data samples can be more effectively evaluated for classification.
In this paper, we present a simple yet fast and robust algorithm which exploits the spatio-temporal context for visual tracking.
Learning a distance metric from the given training samples plays a crucial role in many machine learning tasks, and various models and optimization algorithms have been proposed in the past decade.
One key issue of ISFR is how to effectively and efficiently represent the query face image set by using the gallery face image sets.
Image denoising is a classical yet fundamental problem in low level vision, as well as an ideal test bed to evaluate various statistical image modeling methods.
The means of the Gaussian distributions in the transformed domain can be adaptively estimated by multiplying a bias field with the original signal within the window.
It is widely believed that the l1- norm sparsity constraint on coding coefficients plays a key role in the success of SRC, while its use of all training samples to collaboratively represent the query sample is rather ignored.
Recently the sparse representation based classification (SRC) has been proposed for robust face recognition (FR).
In this paper, the method of kernel Fisher discriminant (KFD) is analyzed and its nature is revealed, i. e., KFD is equivalent to kernel principal component analysis (KPCA) plus Fisher linear discriminant analysis (LDA).