About

Benchmarks

No evaluation results yet. Help compare methods by submit evaluation metrics.

Datasets

Greatest papers with code

Fusion of Detected Objects in Text for Visual Question Answering

IJCNLP 2019 google-research/language

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.

QUESTION ANSWERING VISUAL COMMONSENSE REASONING VISUAL QUESTION ANSWERING

From Recognition to Cognition: Visual Commonsense Reasoning

CVPR 2019 rowanz/r2c

While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.

VISUAL COMMONSENSE REASONING

UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020 ChenRocks/UNITER

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

LANGUAGE MODELLING QUESTION ANSWERING REFERRING EXPRESSION COMPREHENSION REPRESENTATION LEARNING TEXT MATCHING VISUAL COMMONSENSE REASONING VISUAL ENTAILMENT VISUAL QUESTION ANSWERING

Heterogeneous Graph Learning for Visual Commonsense Reasoning

NeurIPS 2019 yuweijiang/HGL-pytorch

Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement.

GRAPH LEARNING VISUAL COMMONSENSE REASONING

Unifying Vision-and-Language Tasks via Text Generation

4 Feb 2021j-min/VL-T5

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

CONDITIONAL TEXT GENERATION IMAGE CAPTIONING LANGUAGE MODELLING MULTI-TASK LEARNING QUESTION ANSWERING REFERRING EXPRESSION COMPREHENSION VISUAL COMMONSENSE REASONING VISUAL QUESTION ANSWERING

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

15 Oct 2020allenai/visual-reasoning-rationalization

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights.

LANGUAGE MODELLING NATURAL LANGUAGE INFERENCE OBJECT RECOGNITION QUESTION ANSWERING VISUAL COMMONSENSE REASONING VISUAL QUESTION ANSWERING

Connective Cognition Network for Directional Visual Commonsense Reasoning

NeurIPS 2019 AmingWu/CCN

Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers.

VISUAL COMMONSENSE REASONING