Multimodal Reasoning
88 papers with code • 3 benchmarks • 9 datasets
Reasoning over multimodal inputs.
Datasets
Most implemented papers
e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations
The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning.
WebQA: Multihop and Multimodal QA
Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation.
Dual Attention Networks for Multimodal Reasoning and Matching
We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language.
Multimodal Analogical Reasoning over Knowledge Graphs
Analogical reasoning is fundamental to human cognition and holds an important place in various fields.
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models
Therefore, we propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph.
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs.
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning.
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio.
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities.