The depth estimation branch is trained with RGB-D images and then used to estimate the pseudo depth maps for all unlabeled RGB images to form the paired data.
Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description.
The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words.
The task of visual question generation (VQG) aims to generate human-like neural questions from an image and potentially other side information (e. g., answer type or the answer itself).
Existing Class Incremental Learning (CIL) methods are based on a supervised classification framework sensitive to data labels.
This paper investigates the feasibility of federated representation learning under the constraints of communication cost and privacy protection.
Furthermore, to obtain a more accurate main direction for Eigen-Reptile in the presence of label noise, we further propose Introspective Self-paced Learning (ISPL).
It is a consensus that small models perform quite poorly under the paradigm of self-supervised contrastive learning.
Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
After investigating existing strategies, we observe that there is a lack of study on how to prevent the inter-phase confusion.
The journey of reducing noise from distant supervision (DS) generated training data has been started since the DS was first introduced into the relation extraction (RE) task.
In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels.
The recent emerged weakly supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels.
In this paper, we propose a novel graph learning framework for phrase grounding in the image.
Ranked #3 on Phrase Grounding on Flickr30k Entities Test
Recently, a newly proposed self-supervised framework Bootstrap Your Own Latent (BYOL) seriously challenges the necessity of negative samples in contrastive-based learning frameworks.
The task of visual question generation~(VQG) aims to generate human-like questions from an image and potentially other side information (e. g. answer type or the answer itself).
In particular, we first cast the meta-overfitting problem (overfitting on sampling and label noise) as a gradient noise problem since few available samples cause meta-learner to overfit on existing examples (clean or corrupted) of an individual task at every gradient step.
To this end, we learn a differentiable graph neural network as a surrogate model to rank candidate architectures, which enable us to obtain gradient w. r. t the input architectures.
Although current mainstream methods begin to combine SSL and AL (SSL-AL) to excavate the diverse expressions of unlabeled samples, these methods' fully supervised task models are still trained only with labeled data.
Recently, a newly proposed self-supervised framework Bootstrap Your Own Latent (BYOL) seriously challenges the necessity of negative samples in contrastive learning frameworks.
In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance.
When patients need to take medicine, particularly taking more than one kind of drug simultaneously, they should be alarmed that there possibly exists drug-drug interaction.
In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting.
Measuring the scholarly impact of a document without citations is an important and challenging problem.
Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues.
no code implementations • 8 May 2020 • Abdelrahman Abdelhamed, Mahmoud Afifi, Radu Timofte, Michael S. Brown, Yue Cao, Zhilu Zhang, WangMeng Zuo, Xiaoling Zhang, Jiye Liu, Wendong Chen, Changyuan Wen, Meng Liu, Shuailin Lv, Yunchao Zhang, Zhihong Pan, Baopu Li, Teng Xi, Yanwen Fan, Xiyu Yu, Gang Zhang, Jingtuo Liu, Junyu Han, Errui Ding, Songhyun Yu, Bumjun Park, Jechang Jeong, Shuai Liu, Ziyao Zong, Nan Nan, Chenghua Li, Zengli Yang, Long Bao, Shuangquan Wang, Dongwoon Bai, Jungwon Lee, Youngjung Kim, Kyeongha Rho, Changyeop Shin, Sungho Kim, Pengliang Tang, Yiyun Zhao, Yuqian Zhou, Yuchen Fan, Thomas Huang, Zhihao LI, Nisarg A. Shah, Wei Liu, Qiong Yan, Yuzhi Zhao, Marcin Możejko, Tomasz Latkowski, Lukasz Treszczotko, Michał Szafraniuk, Krzysztof Trojanowski, Yanhong Wu, Pablo Navarrete Michelini, Fengshuo Hu, Yunhua Lu, Sujin Kim, Wonjin Kim, Jaayeon Lee, Jang-Hwan Choi, Magauiya Zhussip, Azamat Khassenov, Jong Hyun Kim, Hwechul Cho, Priya Kansal, Sabari Nathan, Zhangyu Ye, Xiwen Lu, Yaqi Wu, Jiangxin Yang, Yanlong Cao, Siliang Tang, Yanpeng Cao, Matteo Maggioni, Ioannis Marras, Thomas Tanay, Gregory Slabaugh, Youliang Yan, Myungjoo Kang, Han-Soo Choi, Kyungmin Song, Shusong Xu, Xiaomu Lu, Tingniao Wang, Chunxia Lei, Bin Liu, Rajat Gupta, Vineet Kumar
This challenge is based on a newly collected validation and testing image datasets, and hence, named SIDD+.
To address this challenge, we present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models.
It can attack text classification models with a higher success rate than existing methods, and provide acceptable quality for humans in the meantime.
Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge.
Firstly, we proposed a novel Orientation-Aware feature extraction and fusion Module (OAM), which contains a mixture of 1D and 2D convolutional kernels (i. e., 5 x 1, 1 x 5, and 3 x 3) for extracting orientation-aware features.
Visual navigation is a task of training an embodied agent by intelligently navigating to a target object (e. g., television) using only visual observations.
Despite of the recent success of collective entity linking (EL) methods, these "global" inference methods may yield sub-optimal results when the "all-mention coherence" assumption breaks, and often suffer from high computational cost at the inference stage, due to the complex search space.
The experimental results and further analysis prove that the agent with the MIND module is superior to its counterparts not only in EQA performance but in many other aspects such as route planning, behavioral interpretation, and the ability to generalize from a few examples.
To solve this problem, we propose a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input.
Fine-grained Entity Typing is a tough task which suffers from noise samples extracted from distant supervision.
Recently, distant supervision has gained great success on Fine-grained Entity Typing (FET).
Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations.
In domain-specific NER, due to insufficient labeled training data, deep models usually fail to behave normally.