In this work, we propose two simple yet effective texture randomization mechanisms, Global Texture Randomization (GTR) and Local Texture Randomization (LTR), for Domain Generalization based SRSS.
Weakly-supervised anomaly detection aims at learning an anomaly detector from a limited amount of labeled data and abundant unlabeled data.
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
Given a query patch from a novel class, one-shot object detection aims to detect all instances of that class in a target image through the semantic similarity comparison.
Learning cross-view consistent feature representation is the key for accurate vehicle Re-identification (ReID), since the visual appearance of vehicles changes significantly under different viewpoints.
Fashion products typically feature in compositions of a variety of styles at different clothing parts.
Incorporating knowledge bases (KB) into end-to-end task-oriented dialogue systems is challenging, since it requires to properly represent the entity of KB, which is associated with its KB context and dialogue context.
Most existing crowd counting systems rely on the availability of the object location annotation which can be expensive to obtain.
Most existing crowd counting methods require object location-level annotation, i. e., placing a dot at the center of an object.
Hyperspectral image(HSI) classification has been improved with convolutional neural network(CNN) in very recent years.
Our insight is that the prediction target in SemSL can be modeled as the latent factor in the predictor for the SlfSL target.
Such imbalanced distribution causes a great challenge for learning a deep neural network, which can be boiled down into a dilemma: on the one hand, we prefer to increase the exposure of tail class samples to avoid the excessive dominance of head classes in the classifier training.
To account for this style shift, the model should adjust its parameters in accordance with entity types.
Thus, we propose to create auxiliary fact representations from charge definitions to augment fact descriptions representation.
This paper tackles the problem of video object segmentation.
Experiments on both classification, semantic segmentation and object detection tasks demonstrate the superior performance of the proposed methods over various quantized networks in the literature.
Through adding a common module, video loss, which we formulate with various forms of constraints (including weighted BCE loss, high-dimensional triplet loss, as well as a novel mixed instance-aware video loss), to train the parent network in the step (2), the network is then better prepared for the step (3), i. e. online fine-tuning on the target instance.
Furthermore, we propose a second progressive quantization scheme which gradually decreases the bit-width from high-precision to low-precision during training.
One of the primary challenges faced by deep learning is the degree to which current methods exploit superficial statistics and dataset bias, rather than learning to generalise over the specific representations they have experienced.
At the test stage, the learned memory will be fixed, and the reconstruction is obtained from a few selected memory records of the normal data.
In this paper, we seek to tackle a challenge in training low-precision networks: the notorious difficulty in propagating gradient through a low-precision network due to the non-differentiable quantization function.
A sketch based 3D shape retrieval
The main challenge of this problem comes from the large scale and the fine-grained nature of the product categories as well as the difficulty for collecting training images that reflect the realistic checkout scenarios due to continuous update of the products.
As the post-processing step for object detection, non-maximum suppression (GreedyNMS) is widely used in most of the detectors for many years.
Our rationale is that the mask prediction could be better modeled as a binary segmentation problem and the difficulty of estimating the density could be reduced if the mask is known.
Inspired by the coarse-to-fine hierarchical process, we propose an end-to-end RNN-based Hierarchical Attention (RNN-HA) classification model for vehicle re-identification.
In this paper, we propose to train convolutional neural networks (CNNs) with both binarized weights and activations, leading to quantized models specifically} for mobile devices with limited power capacity and computation resources.
Towards this goal, we present a simple but effective two-branch network to simultaneously map semantic descriptions and visual samples into a joint space, on which visual embeddings are forced to regress to their class-level semantic embeddings and the embeddings crossing classes are required to be distinguishable by a trainable classifier.
In this study, we revisit this problem from an orthog- onal view, and propose a novel learning strategy to maxi- mize the pixel-wise fitting capacity of a given lightweight network architecture.
To solve this problem, we propose an end-to-end trainable deep network which is inspired by the state-of-the-art fine-grained recognition model and is tailored for the FSFG task.
This paper tackles the problem of training a deep convolutional neural network with both low-precision weights and low-bitwidth activations.
In contrast, human vision is able to predict poses by exploiting geometric constraints of landmark point inter-connectivity.
The proposed method still builds one classifier for one interaction (as per type (ii) above), but the classifier built is adaptive to context via weights which are context dependent.
To overcome this visual-semantic discrepancy, this work proposes an objective function to re-align the distributed word embeddings with visual information by learning a neural network to map it into a new representation called visually aligned word embedding (VAWE).
One-shot learning is a challenging problem where the aim is to recognize a class identified by a single training image.
Training a Fully Convolutional Network (FCN) for semantic segmentation requires a large number of masks with pixel level labelling, which involves a large amount of human labour and time for annotation.
In contrast, human vision is able to predict poses by exploiting geometric constraints of joint inter-connectivity.
Ranked #11 on Pose Estimation on MPII Human Pose
The success of deep learning techniques in the computer vision domain has triggered a range of initial investigations into their utility for visual place recognition, all using generic features from networks that were trained for other types of recognition tasks.
The critical observation underpinning our approach is thus that learning the motion flow instead allows the model to focus on the cause of the blur, irrespective of the image content.
In this work, we propose to model the relational information between people as a sequence prediction task.
This paper proposes to improve visual question answering (VQA) with structured representations of both scene contents and questions.
Fine-grained classification is a relatively new field that has concentrated on using information from a single image, while ignoring the enormous potential of using video data to improve classification.
Instance retrieval requires one to search for images that contain a particular object within a large corpus.
The key observation motivating our approach is that "regular object" images, "unusual object" images and "other objects" images exhibit different region-level scores in terms of both the score values and the spatial distributions.
Classifying a visual concept merely from its associated online textual source, such as a Wikipedia article, is an attractive research topic in zero-shot learning because it alleviates the burden of manually collecting semantic attributes.
To address this problem, we propose a novel approach by inspecting the distribution of the detection scores at multiple image regions based on the detector trained from the "regular object" and "other objects".
Most video based action recognition approaches create the video-level representation by temporally pooling the features extracted at each frame.
To handle this limitation, in this paper we break the convention which assumes that a local feature is drawn from one of few Gaussian distributions.
Most of these studies adopt activations from a single DCNN layer, usually the fully-connected layer, as the image representation.
This brings two general discriminative learning frameworks for Gaussian Bayesian networks (GBN).
The purpose of mid-level visual element discovery is to find clusters of image patches that are both representative and discriminative.
Much of the recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
The introduction of low-cost RGB-D sensors has promoted the research in skeleton-based human action recognition.
One challenge is that video contains a varying number of frames which is incompatible to the standard input format of CNNs.
This paper, however, advocates that if used appropriately convolutional layer activations can be turned into a powerful image representation which enjoys many advantages over fully-connected layer activations.
We apply our approach to scene and object classification tasks, and demonstrate that our approach outperforms all previous works on mid-level visual element discovery by a sizeable margin with far fewer elements being used.
By calculating the gradient vector of the proposed model, we derive a new fisher vector encoding strategy, termed Sparse Coding based Fisher Vector Coding (SCFVC).
In the third criterion, which shows the best merging performance, we propose a max-margin-based parameter estimation method and apply it with multinomial distribution.
Analyzing brain networks from neuroimages is becoming a promising approach in identifying novel connectivitybased biomarkers for the Alzheimer's disease (AD).