Image Captioning

619 papers with code • 33 benchmarks • 65 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Libraries

Use these libraries to find Image Captioning models and implementations
4 papers
8,762
3 papers
2,325
See all 8 libraries.

Latest papers with no code

VLRM: Vision-Language Models act as Reward Models for Image Captioning

no code yet • 2 Apr 2024

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models.

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

no code yet • 1 Apr 2024

LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

no code yet • 1 Apr 2024

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.

LocCa: Visual Pretraining with Location-aware Captioners

no code yet • 28 Mar 2024

In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa).

Text Data-Centric Image Captioning with Interactive Prompts

no code yet • 28 Mar 2024

Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model.

A Review of Multi-Modal Large Language and Vision Models

no code yet • 28 Mar 2024

Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality.

A Survey on Large Language Models from Concept to Implementation

no code yet • 27 Mar 2024

Recent advancements in Large Language Models (LLMs), particularly those built on Transformer architectures, have significantly broadened the scope of natural language processing (NLP) applications, transcending their initial use in chatbot technology.

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

no code yet • 26 Mar 2024

In this paper, we propose a solution for improving the quality of captions generated for figures in papers.

Visual Hallucination: Definition, Quantification, and Prescriptive Remediations

no code yet • 26 Mar 2024

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI.

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

no code yet • 26 Mar 2024

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features.