Image Captioning

524 papers with code • 30 benchmarks • 61 datasets

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)


Use these libraries to find Image Captioning models and implementations
4 papers
3 papers
2 papers
See all 7 libraries.

Most implemented papers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning 10 Feb 2015

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.

Show and Tell: A Neural Image Caption Generator

karpathy/neuraltalk CVPR 2015

Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

peteanderson80/bottom-up-attention CVPR 2018

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

Self-critical Sequence Training for Image Captioning

ruotianluo/neuraltalk2.pytorch CVPR 2017

In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

clovaai/CutMix-PyTorch ICCV 2019

Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.

CIDEr: Consensus-based Image Description Evaluation

tylin/coco-caption CVPR 2015

We propose a novel paradigm for evaluating image descriptions that uses human consensus.

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

ashwinkalyan/dbs 7 Oct 2016

We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.

VQA: Visual Question Answering

ramprs/grad-cam ICCV 2015

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Recurrent Neural Network Regularization

wojzaremba/lstm 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

tensorflow/models 21 Sep 2016

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.