Image Captioning
613 papers with code • 32 benchmarks • 64 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Latest papers
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Diffusion models have exhibited remarkable capabilities in text-to-image generation.
Bridging Vision and Language Spaces with Assignment Prediction
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions.
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm.
Disentangled Pre-training for Human-Object Interaction Detection
Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.
Semantic Map-based Generation of Navigation Instructions
In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input.
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Next, to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answering.
VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning
Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding.
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?
ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query.