GIT: A Generative Image-to-text Transformer for Vision and Language

27 May 2022  ·  JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang ·

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Captioning COCO Captions GIT BLEU-4 44.1 # 3
METEOR 32.2 # 4
CIDER 151.1 # 4
SPICE 26.3 # 3
Video Captioning MSR-VTT GIT2 CIDEr 75.9 # 3
METEOR 33.1 # 4
ROUGE-L 68.2 # 3
BLEU-4 54.8 # 3
Visual Question Answering (VQA) MSVD-QA GIT Accuracy 0.568 # 9
Image Captioning nocaps entire GIT, Single Model CIDEr 123.39 # 2
B1 88.1 # 1
B2 74.81 # 1
B3 57.68 # 1
B4 37.35 # 2
ROUGE-L 63.12 # 1
METEOR 32.5 # 1
SPICE 15.94 # 1
Image Captioning nocaps in-domain GIT, Single Model CIDEr 122.4 # 3
B1 88.55 # 2
B2 76.1 # 1
B3 60.53 # 1
B4 41.65 # 1
ROUGE-L 64.02 # 2
METEOR 33.41 # 3
SPICE 16.18 # 2
Image Captioning nocaps in-domain GIT2, Single Model CIDEr 124.18 # 2
B1 88.86 # 1
B2 75.86 # 2
B3 59.94 # 2
B4 41.1 # 3
ROUGE-L 63.82 # 3
METEOR 33.83 # 2
SPICE 16.36 # 1
Image Captioning nocaps near-domain GIT2, Single Model CIDEr 125.51 # 1
B1 88.9 # 1
B2 75.86 # 1
B3 58.9 # 2
B4 38.95 # 2
ROUGE-L 63.66 # 2
METEOR 32.95 # 2
SPICE 16.11 # 1
Image Captioning nocaps near-domain GIT, Single Model CIDEr 123.92 # 3
B1 88.56 # 3
B2 75.48 # 3
B3 58.46 # 3
B4 38.44 # 4
ROUGE-L 63.5 # 3
METEOR 32.86 # 3
SPICE 15.96 # 2
Image Captioning nocaps out-of-domain GIT2, Single Model CIDEr 122.27 # 2
B1 86.28 # 1
B2 71.15 # 3
B3 52.36 # 3
B4 30.15 # 3
ROUGE-L 60.91 # 3
METEOR 30.15 # 4
SPICE 15.62 # 2
Image Captioning nocaps out-of-domain GIT, Single Model CIDEr 122.04 # 3
B1 85.99 # 3
B2 71.28 # 1
B3 52.66 # 1
B4 30.04 # 4
ROUGE-L 60.96 # 2
METEOR 30.45 # 2
SPICE 15.7 # 1
Image Captioning nocaps-XD entire GIT2 CIDEr 124.77 # 1
B1 88.43 # 1
B2 75.02 # 1
B3 57.87 # 1
B4 37.65 # 1
ROUGE-L 63.19 # 1
METEOR 32.56 # 1
SPICE 16.06 # 1
Image Captioning nocaps-XD entire GIT CIDEr 123.39 # 2
B1 88.1 # 2
B2 74.81 # 2
B3 57.68 # 2
B4 37.35 # 2
ROUGE-L 63.12 # 2
METEOR 32.5 # 2
SPICE 15.94 # 2
Image Captioning nocaps-XD in-domain GIT2 CIDEr 124.18 # 1
B1 88.86 # 1
B2 75.86 # 2
B3 59.94 # 2
B4 41.1 # 2
ROUGE-L 63.82 # 2
METEOR 33.83 # 1
SPICE 16.36 # 1
Image Captioning nocaps-XD in-domain GIT CIDEr 122.4 # 2
B1 88.55 # 2
B2 76.1 # 1
B3 60.53 # 1
B4 41.65 # 1
ROUGE-L 64.02 # 1
METEOR 33.41 # 2
SPICE 16.18 # 2
Image Captioning nocaps-XD near-domain GIT2 CIDEr 125.51 # 1
B1 88.9 # 1
B2 75.86 # 1
B3 58.9 # 1
B4 38.95 # 1
ROUGE-L 63.66 # 1
METEOR 32.95 # 1
SPICE 16.11 # 1
Image Captioning nocaps-XD near-domain GIT CIDEr 123.92 # 2
B1 88.56 # 2
B2 75.48 # 2
B3 58.46 # 2
B4 38.44 # 2
ROUGE-L 63.5 # 2
METEOR 32.86 # 2
SPICE 15.96 # 2
Image Captioning nocaps-XD out-of-domain GIT2 CIDEr 122.27 # 1
B1 86.28 # 1
B2 71.15 # 2
B3 52.36 # 2
B4 30.15 # 1
ROUGE-L 60.91 # 2
METEOR 30.15 # 2
SPICE 15.62 # 2
Image Captioning nocaps-XD out-of-domain GIT CIDEr 122.04 # 2
B1 85.99 # 2
B2 71.28 # 1
B3 52.66 # 1
B4 30.04 # 2
ROUGE-L 60.96 # 1
METEOR 30.45 # 1
SPICE 15.7 # 1

Methods