TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	BoolQ	N-Grammer 343M	Accuracy	65	# 42
Language Modelling	C4	N-Grammer 343M	Perplexity	14.79	# 7
Language Modelling	C4	N-Grammer 288M	Perplexity	15.01	# 8
Natural Language Inference	CommitmentBank	N-Grammer 343M	F1	59.7	# 8
Natural Language Inference	CommitmentBank	N-Grammer 343M	Accuracy	67.9	# 14
Question Answering	COPA	N-Grammer 343M	Accuracy	60.0	# 56
Question Answering	MultiRC	N-Grammer 343M	F1	62	# 19
Question Answering	MultiRC	N-Grammer 343M	EM	11.3	# 12
Common Sense Reasoning	ReCoRD	N-Grammer 343M	F1	29.9	# 34
Common Sense Reasoning	ReCoRD	N-Grammer 343M	EM	28.9	# 35
Natural Language Inference	RTE	N-Grammer 343M	Accuracy	59.2%	# 73
Coreference Resolution	Winograd Schema Challenge	N-Grammer 343M	Accuracy	68.3	# 37
Word Sense Disambiguation	Words in Context	N-Grammer 343M	Accuracy	56.1	# 22

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/language-modelling-on-c4)](https://paperswithcode.com/sota/language-modelling-on-c4?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/natural-language-inference-on-commitmentbank)](https://paperswithcode.com/sota/natural-language-inference-on-commitmentbank?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/question-answering-on-multirc)](https://paperswithcode.com/sota/question-answering-on-multirc?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/word-sense-disambiguation-on-words-in-context)](https://paperswithcode.com/sota/word-sense-disambiguation-on-words-in-context?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=n-grammer-augmenting-transformers-with-latent-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/n-grammer-augmenting-transformers-with-latent-1/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=n-grammer-augmenting-transformers-with-latent-1)`

N-Grammer: Augmenting Transformers with latent n-grams

13 Jul 2022 · Aurko Roy, Rohan Anil, Guangda Lai, Benjamin Lee, Jeffrey Zhao, Shuyuan Zhang, Shibo Wang, Ye Zhang, Shen Wu, Rigel Swavely, Tao, Yu, Phuong Dao, Christopher Fifty, Zhifeng Chen, Yonghui Wu ·

Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our model for reproducibility purposes in Jax.

PDF Abstract

Code

Add Remove Mark official

tensorflow/lingvo official

↳ Quickstart in

Colab

2,782

yiyixuxu/n-grammer-flax

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Language Modelling

Natural Language Inference

Question Answering

Text Classification

Word Sense Disambiguation

Datasets

GLUE

BoolQ

SuperGLUE

WSC

COPA

MultiRC

ReCoRD RTE CommitmentBank

Results from the Paper

Edit

Ranked #7 on Language Modelling on C4

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	BoolQ	N-Grammer 343M	Accuracy	65	# 42	Compare
Language Modelling	C4	N-Grammer 343M	Perplexity	14.79	# 7	Compare
Language Modelling	C4	N-Grammer 288M	Perplexity	15.01	# 8	Compare
Natural Language Inference	CommitmentBank	N-Grammer 343M	F1	59.7	# 8	Compare
Natural Language Inference	CommitmentBank	N-Grammer 343M	Accuracy	67.9	# 14	Compare
Question Answering	COPA	N-Grammer 343M	Accuracy	60.0	# 56	Compare
Question Answering	MultiRC	N-Grammer 343M	F1	62	# 19	Compare
Question Answering	MultiRC	N-Grammer 343M	EM	11.3	# 12	Compare
Common Sense Reasoning	ReCoRD	N-Grammer 343M	F1	29.9	# 34	Compare
Common Sense Reasoning	ReCoRD	N-Grammer 343M	EM	28.9	# 35	Compare
Natural Language Inference	RTE	N-Grammer 343M	Accuracy	59.2%	# 73	Compare
Coreference Resolution	Winograd Schema Challenge	N-Grammer 343M	Accuracy	68.3	# 37	Compare
Word Sense Disambiguation	Words in Context	N-Grammer 343M	Accuracy	56.1	# 22	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Depthwise Convolution • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-DConv-Head Attention • Multi-Head Attention • Position-Wise Feed-Forward Layer • Primer • Residual Connection • Scaled Dot-Product Attention • Softmax • Squared ReLU • Transformer

Edit Social Preview

N-Grammer: Augmenting Transformers with latent n-grams

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove