Learning from the web can ease the extreme dependence of deep learning on large-scale manually labeled datasets.
However, deep hashing networks are vulnerable to adversarial examples, which is a practical secure problem but seldom studied in hashing-based retrieval field.
To further mine the non-salient region objects, we propose to exert the segmentation network's self-correction ability.
Due to the memorization effect in Deep Neural Networks (DNNs), training with noisy labels usually results in inferior model performance.
Then we utilize the fused prototype to guide the final segmentation of the query image.
Labeling objects at a subordinate level typically requires expert knowledge, which is not always available when using random annotators.
We propose to incorporate the prior about the co-occurrence of relation pairs into the graph to further help alleviate the class imbalance issue.
One bottleneck (i. e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i. e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes.
Binary optimization, a representative subclass of discrete optimization, plays an important role in mathematical optimization and has various applications in computer vision and machine learning.
It extracts this complementary information of different modality from a connection block, which aims at exploring correlations of different stream features.
Ranked #10 on Action Recognition on HMDB-51 (using extra training data)
Since deep networks are capable of memorizing the entire dataset, the corrupted samples generated by vanilla MixUp with a badly chosen interpolation policy will degrade the performance of networks.
In this paper, we propose an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales.
Ranked #37 on Action Recognition on Something-Something V1
Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity.
Recent successes in visual recognition can be primarily attributed to feature representation, learning algorithms, and the ever-increasing size of labeled training data.
To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation.
Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables.
Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition.
Hashing techniques are in great demand for a wide range of real-world applications such as image retrieval and network compression.
How to economically cluster large-scale multi-view images is a long-standing problem in computer vision.
Despite the remarkable success of Convolutional Neural Networks (CNNs) on generalized visual tasks, high computational and memory costs restrict their comprehensive applications on consumer electronics (e. g., portable or smart wearable devices).
The generative model learns a mapping that the distributions of sketches can be indistinguishable from the distribution of natural images using an adversarial loss, and simultaneously learns an inverse mapping based on the cycle consistency loss in order to enhance the indistinguishability.
To advance subtle expression recognition, we contribute a Large-scale Subtle Emotions and Mental States in the Wild database (LSEMSW).
As an important part of ZSIH, we formulate a generative hashing scheme in reconstructing semantic knowledge representations for zero-shot retrieval.
While each view of the stereoscopic pair is processed in an individual path, a novel feature aggregation strategy is proposed to effectively share information between the two paths.
We propose a novel image classification method based on learning hierarchical inter-class structures.
To eliminate manual annotation, in this work, we propose a novel image dataset construction framework by employing multiple textual queries.
Extensive experiments on four realistic action datasets in terms of three tasks (i. e., partial action retrieval, recognition and prediction) clearly show the superiority of PRBC over the state-of-the-art methods, along with significantly reduced memory load and computational costs during the online test.
Our ZSECOC equips the conventional ECOC with the additional capability of ZSAR, by addressing the domain shift problem.
Ranked #2 on Zero-Shot Action Recognition on Olympics
By additionally introducing manifold regularizations on visual data and semantic embeddings, the learned projection can effectively captures the geometrical manifold structure residing in both visual and semantic spaces.
Using the proposed Unseen Visual Data Synthesis (UVDS) algorithm, semantic attributes are effectively utilised as an intermediate clue to synthesise unseen visual features at the training stage.
To tackle these problems, in this work, we exploit general corpus information to automatically select and subsequently classify web images into semantic rich (sub-)categories.
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view retrieval task, in which queries are abstract and ambiguous sketches while the retrieval database is formed with natural images.
Along with the prosperity of recurrent neural network in modelling sequential data and the power of attention mechanism in automatically identify salient information, image captioning, a. k. a., image description, has been remarkably advanced in recent years.
The first approach, termed Inner-product Binary Coding (IBC), preserves the inner relationships of images and videos in a common Hamming space.
To reduce the cost of manual labelling, there has been increased research interest in automatically constructing image datasets by exploiting web images.
Supervised knowledge e. g. semantic labels or pair-wise relationship) associated to data is capable of significantly improving the quality of hash codes and hash functions.
Video captioning has been attracting broad research attention in multimedia community.
The optimization alternatively proceeds over the binary classifiers and image hash codes.
Inspired by the latest advance in asymmetric hashing schemes, we propose an asymmetric binary code learning framework based on inner product fitting.
In addition, a supervised inductive manifold hashing framework is developed by incorporating the label information, which is shown to greatly advance the semantic retrieval performance.
We propose a very simple, efficient yet surprisingly effective feature extraction method for face recognition (about 20 lines of Matlab code), which is mainly inspired by spatial pyramid pooling in generic image classification.
Spatial pyramid pooling of features encoded by an over-complete dictionary has been the key component of many state-of-the-art image classification systems.
The main finding of this work is that the standard image classification pipeline, which consists of dictionary learning, feature encoding, spatial pyramid pooling and linear classification, outperforms all state-of-the-art face recognition methods on the tested benchmark datasets (we have tested on AR, Extended Yale B, the challenging FERET, and LFW-a datasets).
Minimization of the $L_\infty$ norm, which can be viewed as approximately solving the non-convex least median estimation problem, is a powerful method for outlier removal and hence robust regression.