We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck.
Class-Incremental Learning (CIL)  trains classifiers under a strict memory budget: in each incremental phase, learning is done for new data, most of which is abandoned to free space for the next phase.
A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics).
To implement this module, we define two variants of attention: self-attention on the summed-up feature map, and cross-attention between two feature maps before summed up.
Attention module does not always help deep models learn causal features that are robust in any confounding context, e. g., a foreground object feature is invariant to different backgrounds.
Pre-trained multilingual language models, e. g., multilingual-BERT, are widely used in cross-lingual tasks, yielding the state-of-the-art performance.
However, the theoretical solution provided by transportability is far from practical for UDA, because it requires the stratification and representation of the unobserved confounder that is the cause of the domain gap.
Existing food image segmentation models are underperforming due to two reasons: (1) there is a lack of high quality food image datasets with fine-grained ingredient labels and pixel-wise location masks -- the existing datasets either carry coarse ingredient labels or are small in size; and (2) the complex appearance of food makes it difficult to localize and recognize ingredients in food images, e. g., the ingredients may overlap one another in the same image, and the identical ingredient may appear distinctly in different food images.
Ranked #2 on Semantic Segmentation on FoodSeg103 (using extra training data)
We show that the key reason is that the generation is not Counterfactual Faithful, and thus we propose a faithful one, whose generation is from the sample-specific counterfactual question: What would the sample look like, if we set its class attribute to a certain class, while keeping its sample attribute unchanged?
We then conduct two novel CVC inference methods (on trained models) to capture the effect of comprehensive reasoning as the final prediction.
Class-Incremental Learning (CIL) aims to learn a classification model with the number of classes increasing phase-by-phase.
Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view.
We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS).
Ranked #2 on Weakly-Supervised Semantic Segmentation on PASCAL VOC 2012 val (mIoU metric)
Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales.
We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA.
Ranked #4 on Image Captioning on COCO Captions
However, there is an inherent trade-off to effectively learning new concepts without catastrophic forgetting of previous ones.
In this paper, we propose a novel approach called meta-transfer learning (MTL) which learns to transfer the weights of a deep NN for few-shot learning tasks.
In this paper, we alleviate the missing-annotation problem and enable the joint reasoning by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships).
On each task, we train a few-shot model to predict pseudo labels for unlabeled data, and then iterate the self-training steps on labeled and pseudo-labeled data with each step followed by fine-tuning.
Image-to-image (I2I) translation is a pixel-level mapping that requires a large number of paired training data and often suffers from the problems of high diversity and strong category bias in image scenes.
"Empirical" means that the hyperparameters, e. g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data.
We propose a novel approach to jointly perform 3D shape retrieval and pose estimation from monocular images. In order to make the method robust to real-world image variations, e. g. complex textures and backgrounds, we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.
In this paper we propose a novel few-shot learning method called meta-transfer learning (MTL) which learns to adapt a deep NN for few shot learning tasks.
As more and more personal photos are shared and tagged in social media, avoiding privacy risks such as unintended recognition becomes increasingly challenging.
Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information.
Ranked #2 on Gesture-to-Gesture Translation on Senz3D
As more and more personal photos are shared online, being able to obfuscate identities in such photos is becoming a necessity for privacy protection.
This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose.
Ranked #3 on Gesture-to-Gesture Translation on Senz3D
Experimental results on three public datasets and two proposed datasets demonstrate the superiority of the proposed approach, indicating the effectiveness of body structure and orientation information for improving re-identification performance.