Most deep metric learning (DML) methods employ a strategy that forces all positive samples to be close in the embedding space while keeping them away from negative ones.
Specifically, based on the two discoveries of local spatial similarity and adjacent temporal correspondence of the sequential image data, we propose a novel Target-Domain driven pseudo label Diffusion (TDo-Dif) scheme.
The large variation of viewpoint and irrelevant content around the target always hinder accurate image retrieval and its subsequent tasks.
In this paper, we investigate using rolling shutter with a global reset feature (RSGR) to restore clean global shutter (GS) videos.
OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy.
This framework consists of three key components, i. e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module.
However, such methods have two main drawbacks particularly in large-scale applications; (1) the pairwise approach is severely inefficient due to the quadratic computational cost; and (2) even recent model-based samplers (e. g. IRGAN) cannot achieve practical efficiency due to the training of an extra model.
Learning from implicit user feedback is challenging as we can only observe positive samples but never access negative ones.
In this paper, we introduce coherence priors between the semantics and textures which make it possible to concentrate on completing separate textures in a semantic-wise manner.
Solving cold-start problems is indispensable to provide meaningful recommendation results for new users and items.
We formulate the mutual transformations between the outputs of regression- and detection-based models as two scene-agnostic transformers which enable knowledge distillation between the two models.
Completing a corrupted image with correct structures and reasonable textures for a mixed scene remains an elusive challenge.
Unsupervised learning can discover various unseen diseases, relying on large-scale unannotated medical images of healthy subjects.
An efficient and effective person re-identification (ReID) system relieves the users from painful and boring video watching and accelerates the process of video analysis.
To gain the superiority of deep learning models, we treat a group as multiple persons and transfer the domain of a labeled ReID dataset to a G-ReID target dataset style to learn single representations.
To demonstrate the illumination issue and to evaluate our model, we construct two large-scale simulated datasets with a wide range of illumination variations.
Convolutional Neural Network (CNN)-based accurate prediction typically requires large-scale annotated training data.
Accurate Computer-Assisted Diagnosis, associated with proper data wrangling, can alleviate the risk of overlooking the diagnosis in a clinical environment.
Diffusion is commonly used as a ranking or re-ranking method in retrieval tasks to achieve higher retrieval performance, and has attracted lots of attention in recent years.
Ranked #1 on Image Retrieval on Par6k
Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker.
Sound Audio and Speech Processing
First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm.
The proposed method can retrieve and localize objects specified by a textual query from one million images in only 0. 5 seconds with high precision.
Second, to help users specify spatial relationships among objects in an intuitive way, we propose recommendation techniques of spatial relationships.
Although convolutional neural networks (CNNs) have achieved promising results in learning such concepts, it remains an open question as to how to effectively use CNNs for abnormal event detection, mainly due to the environment-dependent nature of the anomaly detection.
In this paper, we propose a stand-alone mobile visual search system based on binary features and the bag-of-visual words framework.
Recently, the Fisher vector representation of local features has attracted much attention because of its effectiveness in both image classification and image retrieval.
This work explores the suitability for instance retrieval of image- and region-wise representations pooled from an object detection CNN such as Faster R-CNN.
Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision.