Image Captioning

613 papers with code • 32 benchmarks • 64 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

5 papers

124,457

salesforce/lavis

4 papers

8,674

ofa-sys/ofa

3 papers

2,319

google-research/big_vision

3 papers

1,537

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Latest papers

Most implemented Social Latest No code

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

wangyuchi369/ladic • • 16 Apr 2024

Diffusion models have exhibited remarkable capabilities in text-to-image generation.

16 Apr 2024

Paper
Code

Bridging Vision and Language Spaces with Assignment Prediction

park-jungin/vlap • • 15 Apr 2024

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.

15 Apr 2024

Paper
Code

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

aashish2000/anchor • 15 Apr 2024

With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions.

15 Apr 2024

Paper
Code

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

faceonlive/ai-research • 12 Apr 2024

This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.

124

12 Apr 2024

Paper
Code

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Karine-Huang/T2I-CompBench • • 4 Apr 2024

We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm.

131

04 Apr 2024

Paper
Code

Disentangled Pre-training for Human-Object Interaction Detection

xingaoli/dp-hoi • 2 Apr 2024

Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.

02 Apr 2024

Paper
Code

Semantic Map-based Generation of Navigation Instructions

chengzu-li/vlgen • 28 Mar 2024

In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input.

28 Mar 2024

Paper
Code

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

inhwanbae/lmtrajectory • 27 Mar 2024

Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering.

27 Mar 2024

Paper
Code

VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning

ys-zong/vl-icl • • 19 Mar 2024

Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.

19 Mar 2024

Paper
Code

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

AU-DIS/ConQA • • European Conference on Information Retrieval 2024

ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query.

15 Mar 2024

Paper
Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result