Multimodal Reasoning

37 papers with code • 3 benchmarks • 4 datasets

Reasoning over multimodal inputs.

Most implemented papers

e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations

virginie-do/e-SNLI-VE 7 Apr 2020

The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning.

Dual Attention Networks for Multimodal Reasoning and Matching

iammrhelo/pytorch-vqa-dan CVPR 2017

We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language.

WebQA: Multihop and Multimodal QA

WebQnA/WebQA_Baseline CVPR 2022

Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation.

Multimodal Analogical Reasoning over Knowledge Graphs

zjunlp/MKG_Analogy 1 Oct 2022

Analogical reasoning is fundamental to human cognition and holds an important place in various fields.

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models

zoeyyao27/graph-of-thought 26 May 2023

Therefore, we propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph.

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

declare-lab/puzzle-reasoning 6 Mar 2024

We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning.

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

declare-lab/llm-puzzletest 20 Mar 2024

As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns.

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

phellonchen/DMRM 18 Dec 2019

Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image.

A Multimodal Framework for the Detection of Hateful Memes

Nithin-Holla/meme_challenge 23 Dec 2020

An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes.

UniT: Multimodal Multitask Learning with a Unified Transformer

facebookresearch/mmf ICCV 2021

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.