Image Captioning

613 papers with code • 32 benchmarks • 64 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Captioning

Dataset	Best Model	Compare
COCO Captions	mPLUG	See all
MS COCO	ExpansionNet v2	See all
nocaps-val-in-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-overall	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps in-domain	GIT2, Single Model	See all
nocaps-val-near-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps-val-out-domain	BLIP-2 ViT-G FlanT5 XL (zero-shot)	See all
nocaps near-domain	GIT2, Single Model	See all
nocaps out-of-domain	PaLI	See all
SCICAP	CNN+LSTM (Vision only, First sentence)	See all
nocaps entire	Lyrics	See all
Flickr30k Captions test	Unified VLP	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
nocaps-XD entire	GIT2	See all
nocaps-XD in-domain	GIT2	See all
nocaps-XD near-domain	GIT2	See all
nocaps-XD out-of-domain	GIT2	See all
nocaps val	Prismer	See all
COCO Captions test	From Captions to Visual Concepts and Back	See all
Localized Narratives	LoopCAG	See all
FlickrStyle10K	CapDec	See all
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	See all
BanglaLekhaImageCaptions	CNN + 1D CNN	See all
AIC-ICC	CMCL	See all
MSCOCO	CapDec	See all
IU X-Ray	BiomedGPT	See all
Peir Gross	BiomedGPT	See all
MS-COCO	NeuSyRE	See all
ChEBI-20	GIT-Mol	See all
VizWiz 2020 test-dev	IBM Research AI	See all
VizWiz 2020 test	IBM Research AI	See all
TextCaps 2020	TAP	See all

Show all 32 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Captioning models and implementations

huggingface/transformers

5 papers

124,527

salesforce/lavis

4 papers

8,685

ofa-sys/ofa

3 papers

2,319

google-research/big_vision

3 papers

1,537

See all 8 libraries.

Datasets

Subtasks

Semi Supervised Learning for Image Captioning

Aesthetic Image Captioning

Vietnamese Image Captioning

Hindi Image Captioning

Latest papers with no code

Most implemented Social Latest No code

On Speculative Decoding for Multimodal Large Language Models

no code yet • 13 Apr 2024

We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model.

Paper
Add Code

View Selection for 3D Captioning via Diffusion Ranking

no code yet • 11 Apr 2024

Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications.

Paper
Add Code

Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation

no code yet • 6 Apr 2024

Experimental results on FineGrip demonstrate the feasibility of the panoptic perception task and the beneficial effect of multi-task joint optimization on individual tasks.

Paper
Add Code

Would Deep Generative Models Amplify Bias in Future Models?

no code yet • 4 Apr 2024

We investigate the impact of deep generative models on potential social biases in upcoming computer vision models.

Paper
Add Code

Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

no code yet • 3 Apr 2024

This study contributes to the advancement of synthetic image detection by exploiting the capabilities of visual language models such as BLIP-2 and ViTGPT2.

Paper
Add Code

Bi-LORA: A Vision-Language Approach for Synthetic Image Detection

no code yet • 2 Apr 2024

Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images.

Paper
Add Code

VLRM: Vision-Language Models act as Reward Models for Image Captioning

no code yet • 2 Apr 2024

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models.

Paper
Add Code

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

no code yet • 1 Apr 2024

LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.

Paper
Add Code

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

no code yet • 1 Apr 2024

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.

Paper
Add Code

LocCa: Visual Pretraining with Location-aware Captioners

no code yet • 28 Mar 2024

In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa).

Paper
Add Code

Image Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result