Image Captioning
622 papers with code • 32 benchmarks • 66 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Libraries
Use these libraries to find Image Captioning models and implementationsDatasets
Subtasks
Latest papers
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
MeaCap: Memory-Augmented Zero-shot Image Captioning
The framework of MeaCap achieves the state-of-the-art performance on a series of zero-shot IC settings.
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.
What Is Missing in Multilingual Visual Reasoning and How to Fix It
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models.
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models.
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images
Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged.
GPTs Are Multilingual Annotators for Sequence Generation Tasks
However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive.
Text-Guided Image Clustering
We, therefore, propose Text-Guided Image Clustering, i. e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text.