Visual Question Answering (VQA)

764 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

FrankZxShen/ATS 14 Mar 2024

Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.

3
14 Mar 2024

Multi-modal Auto-regressive Modeling via Visual Words

pengts/vw-lmm 12 Mar 2024

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities.

14
12 Mar 2024

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

yuliang-liu/monkey 7 Mar 2024

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.

1,396
07 Mar 2024

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

lab-rasool/awesome-medical-vlms-and-datasets 4 Mar 2024

Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering (VQA).

2
04 Mar 2024

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

matthewdm0816/bridgeqa 24 Feb 2024

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e. g., only around 800 scenes are utilized in ScanQA and SQA dataset).

4
24 Feb 2024

Uncertainty-Aware Evaluation for Vision-Language Models

ensec-ai/vlm-uncertainty-bench 22 Feb 2024

Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs.

7
22 Feb 2024

CommVQA: Situating Visual Question Answering in Communicative Contexts

nnaik39/commvqa 22 Feb 2024

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation.

1
22 Feb 2024

CoLLaVO: Crayon Large Language and Vision mOdel

ByungKwanLee/CoLLaVO 17 Feb 2024

Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.

36
17 Feb 2024

Multi-modal preference alignment remedies regression of visual instruction tuning on language model

findalexli/mllm-dpo 16 Feb 2024

In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.

3
16 Feb 2024

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

opengvlab/multi-modality-arena 14 Feb 2024

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

364
14 Feb 2024