In this paper, we rethink implicit reasoning process in VQA, and propose a new formulation which maximizes the log-likelihood of joint distribution for the observed question and predicted answer.
Language bias is a critical issue in Visual Question Answering (VQA), where models often exploit dataset biases for the final decision without considering the image information.
Ranked #2 on Visual Question Answering on VQA-CP
To date, learning weakly supervised panoptic segmentation (WSPS) with only image-level labels remains unexplored.
Weakly supervised instance segmentation (WSIS) with only image-level labels has recently drawn much attention.
With the spirit of NAS, we propose to search for an efficient network architecture (NPPNet) to tackle two tasks at the same time.
Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution.
Ranked #1 on Salient Object Detection on DUTS-TE (MAE metric)
Compared with the existing practice of feature concatenation, we find that uncovering the correlation among the three factors is a superior way of leveraging the pivotal contextual cues provided by edges and poses.
On the discriminator, GVB contributes to enhance the discriminating ability, and balance the adversarial training process.
In particular, the advantage of CHR is more significant in the scenarios with fewer positive training samples, which demonstrates its potential application in real-world security inspection.
We consider spatial contexts, for which we solve so-called jigsaw puzzles, i. e., each image is cut into grids and then disordered, and the goal is to recover the correct configuration.
Optimizing a deep neural network is a fundamental task in computer vision, yet direct training methods often suffer from over-fitting.
Computer vision is difficult, partly because the desired mathematical function connecting input and output data is often complex, fuzzy and thus hard to learn.
First, a novel cost-sensitive multi-task loss function is designed to learn transferable aging features by training on the source population.
Our deep architecture explicitly leverages the human part cues to alleviate the pose variations and learn robust feature representations from both the global image and different local parts.
Ranked #77 on Person Re-Identification on Market-1501
And we propose a semi-supervised attribute learning framework which progressively boosts the accuracy of attributes only using a limited number of labeled data.
Since attributes are generally correlated, we introduce a low rank attribute embedding into the MTL formulation to embed original binary attributes to a continuous attribute space, where incorrect and incomplete attributes are rectified and recovered to better describe people.