524 papers with code • 30 benchmarks • 61 datasets
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.
We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.