Image Captioning

613 papers with code • 32 benchmarks • 64 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,674
3 papers
2,319
See all 8 libraries.

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

wangyuchi369/ladic 16 Apr 2024

Diffusion models have exhibited remarkable capabilities in text-to-image generation.

10
16 Apr 2024

Bridging Vision and Language Spaces with Assignment Prediction

park-jungin/vlap 15 Apr 2024

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.

3
15 Apr 2024

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

aashish2000/anchor 15 Apr 2024

With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions.

0
15 Apr 2024

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

faceonlive/ai-research 12 Apr 2024

This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.

124
12 Apr 2024

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Karine-Huang/T2I-CompBench 4 Apr 2024

We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm.

131
04 Apr 2024

Disentangled Pre-training for Human-Object Interaction Detection

xingaoli/dp-hoi 2 Apr 2024

Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.

1
02 Apr 2024

Semantic Map-based Generation of Navigation Instructions

chengzu-li/vlgen 28 Mar 2024

In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input.

0
28 Mar 2024

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

inhwanbae/lmtrajectory 27 Mar 2024

Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering.

28
27 Mar 2024

VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning

ys-zong/vl-icl 19 Mar 2024

Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.

8
19 Mar 2024

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

AU-DIS/ConQA European Conference on Information Retrieval 2024

ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query.

0
15 Mar 2024