TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	Mega	Top 1 Accuracy	82.4%	# 491
Image Classification	ImageNet	Mega	Number of params	90M	# 847
Long-range modeling	LRA	Mega	ListOps	63.14	# 1
Long-range modeling	LRA	Mega	Text	90.43	# 1
Long-range modeling	LRA	Mega	Retrieval	91.25	# 3
Long-range modeling	LRA	Mega	Image	90.44	# 1
Long-range modeling	LRA	Mega	Pathfinder	96.01	# 3
Long-range modeling	LRA	Mega	Avg	88.21	# 1
Long-range modeling	LRA	Mega	Pathfinder-X	97.98	# 3
Long-range modeling	LRA	Mega-chunk	ListOps	58.76	# 11
Long-range modeling	LRA	Mega-chunk	Text	90.19	# 2
Long-range modeling	LRA	Mega-chunk	Retrieval	90.97	# 8
Long-range modeling	LRA	Mega-chunk	Image	85.8	# 11
Long-range modeling	LRA	Mega-chunk	Pathfinder	94.41	# 7
Long-range modeling	LRA	Mega-chunk	Avg	85.66	# 8
Long-range modeling	LRA	Mega-chunk	Pathfinder-X	93.81	# 8
Language Modelling	WikiText-103	Mega	Test perplexity	18.07	# 30
Language Modelling	WikiText-103	Mega	Number of params	252M	# 17
Machine Translation	WMT2014 English-German	Mega	BLEU score	29.01	# 34
Machine Translation	WMT2014 English-German	Mega	SacreBLEU	27.96	# 7
Machine Translation	WMT2014 English-German	Mega	Number of Params	67M	# 11
Machine Translation	WMT2014 German-English	Mega	BLEU score	33.12	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mega-moving-average-equipped-gated-attention/long-range-modeling-on-lra)](https://paperswithcode.com/sota/long-range-modeling-on-lra?p=mega-moving-average-equipped-gated-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mega-moving-average-equipped-gated-attention/machine-translation-on-wmt2014-german-english)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-german-english?p=mega-moving-average-equipped-gated-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mega-moving-average-equipped-gated-attention/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=mega-moving-average-equipped-gated-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mega-moving-average-equipped-gated-attention/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=mega-moving-average-equipped-gated-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mega-moving-average-equipped-gated-attention/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=mega-moving-average-equipped-gated-attention)`

Mega: Moving Average Equipped Gated Attention

21 Sep 2022 · Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer ·

The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/mega official

286

huggingface/transformers

124,889

lucidrains/gated-state-spaces-pytor…

ethanbar11/ssm_2d

linghao-jin/canmt-challenges

Tasks

Add Remove

Image Classification

Inductive Bias

Language Modelling

Long-range modeling

Machine Translation

Position

Datasets

ImageNet

WikiText-2

WikiText-103

Speech Commands

WMT 2014 LRA

ListOps

Results from the Paper

Edit

Ranked #1 on Long-range modeling on LRA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	Mega	Top 1 Accuracy	82.4%	# 491	Compare
Image Classification	ImageNet	Mega	Number of params	90M	# 847	Compare
Long-range modeling	LRA	Mega	ListOps	63.14	# 1	Compare
			Text	90.43	# 1	Compare
			Retrieval	91.25	# 3	Compare
			Image	90.44	# 1	Compare
			Pathfinder	96.01	# 3	Compare
			Avg	88.21	# 1	Compare
			Pathfinder-X	97.98	# 3	Compare
Long-range modeling	LRA	Mega-chunk	ListOps	58.76	# 11	Compare
			Text	90.19	# 2	Compare
			Retrieval	90.97	# 8	Compare
			Image	85.8	# 11	Compare
			Pathfinder	94.41	# 7	Compare
			Avg	85.66	# 8	Compare
			Pathfinder-X	93.81	# 8	Compare
Language Modelling	WikiText-103	Mega	Test perplexity	18.07	# 30	Compare
Language Modelling	WikiText-103	Mega	Number of params	252M	# 17	Compare
Machine Translation	WMT2014 English-German	Mega	BLEU score	29.01	# 34	Compare
			SacreBLEU	27.96	# 7	Compare
			Number of Params	67M	# 11	Compare
Machine Translation	WMT2014 German-English	Mega	BLEU score	33.12	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Mega: Moving Average Equipped Gated Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove