TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	X-Transformer	BLEU-4	39.7	# 21
Image Captioning	COCO Captions	X-Transformer	METEOR	29.5	# 17
Image Captioning	COCO Captions	X-Transformer	ROUGE-L	59.1	# 7
Image Captioning	COCO Captions	X-Transformer	CIDER	132.8	# 23
Image Captioning	COCO Captions	X-Transformer	SPICE	23.4	# 19
Image Captioning	COCO Captions	X-Transformer	BLEU-1	80.9	# 5
Image Captioning	COCO Captions	X-Transformer	BLEU-2	65.8	# 1
Image Captioning	COCO Captions	X-Transformer	BLEU-3	51.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/x-linear-attention-networks-for-image/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=x-linear-attention-networks-for-image)`

X-Linear Attention Networks for Image Captioning

CVPR 2020 · Yingwei Pan, Ting Yao, Yehao Li, Tao Mei ·

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at \url{https://github.com/Panda-Peter/image-captioning}.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Code

Add Remove Mark official

Panda-Peter/image-captioning official

268

jdai-cv/image-captioning

268

Tasks

Add Remove

Decoder

Fine-Grained Visual Recognition

Image Captioning

Question Answering

Sentence

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

COCO Captions

Results from the Paper

Edit

Ranked #21 on Image Captioning on COCO Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	X-Transformer	BLEU-4	39.7	# 21	Compare
			METEOR	29.5	# 17	Compare
			ROUGE-L	59.1	# 7	Compare
			CIDER	132.8	# 23	Compare
			SPICE	23.4	# 19	Compare
			BLEU-1	80.9	# 5	Compare
			BLEU-2	65.8	# 1	Compare
			BLEU-3	51.5	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

X-Linear Attention Networks for Image Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove