Image Captioning

622 papers with code • 32 benchmarks • 66 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

4 papers

126,353

salesforce/lavis

4 papers

8,907

ofa-sys/ofa

3 papers

2,340

google-research/big_vision

3 papers

1,772

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Latest papers

Most implemented Social Latest No code

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

zh460045050/v2l-tokenizer • • 12 Mar 2024

To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.

12 Mar 2024

Paper
Code

MeaCap: Memory-Augmented Zero-shot Image Captioning

joeyz0z/meacap • • 6 Mar 2024

The framework of MeaCap achieves the state-of-the-art performance on a series of zero-shot IC settings.

06 Mar 2024

Paper
Code

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

YoucanBaby/VTG-GPT • Applied Sciences 2024

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.

04 Mar 2024

Paper
Code

What Is Missing in Multilingual Visual Reasoning and How to Fix It

yueqis/multilingual_visual_reasoning • • 3 Mar 2024

NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.

03 Mar 2024

Paper
Code

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

salanueva/sr4g • 1 Mar 2024

We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models.

01 Mar 2024

Paper
Code

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

keio-smilab24/Polos • • 28 Feb 2024

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models.

28 Feb 2024

Paper
Code

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

nohtow/wtf-rl • • 21 Feb 2024

Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.

21 Feb 2024

Paper
Code

Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images

katiefraser/pairs • 8 Feb 2024

Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged.

08 Feb 2024

Paper
Code

GPTs Are Multilingual Annotators for Sequence Generation Tasks

c-juhwan/gpt-multilingual-annotator • • 8 Feb 2024

However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive.

08 Feb 2024

Paper
Code

Text-Guided Image Clustering

andst/text_guided_cl • 5 Feb 2024

We, therefore, propose Text-Guided Image Clustering, i. e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text.

05 Feb 2024

Paper
Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result