Visual Question Answering (VQA)
764 papers with code • 62 benchmarks • 112 datasets
Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.
Image Source: visualqa.org
Libraries
Use these libraries to find Visual Question Answering (VQA) models and implementationsDatasets
Subtasks
Latest papers
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
Multi-modal Auto-regressive Modeling via Visual Words
Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities.
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering (VQA).
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e. g., only around 800 scenes are utilized in ScanQA and SQA dataset).
Uncertainty-Aware Evaluation for Vision-Language Models
Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs.
CommVQA: Situating Visual Question Answering in Communicative Contexts
Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation.
CoLLaVO: Crayon Large Language and Vision mOdel
Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.
Multi-modal preference alignment remedies regression of visual instruction tuning on language model
In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.