Image Captioning

622 papers with code • 32 benchmarks • 66 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,907
3 papers
2,340
See all 8 libraries.

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

zh460045050/v2l-tokenizer 12 Mar 2024

To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.

86
12 Mar 2024

MeaCap: Memory-Augmented Zero-shot Image Captioning

joeyz0z/meacap 6 Mar 2024

The framework of MeaCap achieves the state-of-the-art performance on a series of zero-shot IC settings.

13
06 Mar 2024

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

YoucanBaby/VTG-GPT Applied Sciences 2024

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.

68
04 Mar 2024

What Is Missing in Multilingual Visual Reasoning and How to Fix It

yueqis/multilingual_visual_reasoning 3 Mar 2024

NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.

1
03 Mar 2024

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

salanueva/sr4g 1 Mar 2024

We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models.

0
01 Mar 2024

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

keio-smilab24/Polos 28 Feb 2024

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models.

9
28 Feb 2024

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

nohtow/wtf-rl 21 Feb 2024

Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.

4
21 Feb 2024

Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images

katiefraser/pairs 8 Feb 2024

Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged.

2
08 Feb 2024

GPTs Are Multilingual Annotators for Sequence Generation Tasks

c-juhwan/gpt-multilingual-annotator 8 Feb 2024

However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive.

0
08 Feb 2024

Text-Guided Image Clustering

andst/text_guided_cl 5 Feb 2024

We, therefore, propose Text-Guided Image Clustering, i. e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text.

6
05 Feb 2024