TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Linguistic Acceptability	CoLA	FNet-Large	Accuracy	78%	# 9
Semantic Textual Similarity	MRPC	FNet-Large	Accuracy	88%	# 23
Natural Language Inference	MultiNLI	FNet-Large	Matched	78	# 42
Natural Language Inference	MultiNLI	FNet-Large	Mismatched	76	# 33
Natural Language Inference	MultiNLI	BERT-Large	Matched	88	# 15
Natural Language Inference	MultiNLI	BERT-Large	Mismatched	88	# 10
Natural Language Inference	QNLI	FNet-Large	Accuracy	85%	# 40
Paraphrase Identification	Quora Question Pairs	FNet-Large	F1	85	# 5
Natural Language Inference	RTE	FNet-Large	Accuracy	69%	# 57
Sentiment Analysis	SST-2 Binary classification	FNet-Large	Accuracy	94	# 37
Semantic Textual Similarity	STS Benchmark	FNet-Large	Spearman Correlation	0.84	# 28

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/paraphrase-identification-on-quora-question)](https://paperswithcode.com/sota/paraphrase-identification-on-quora-question?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=fnet-mixing-tokens-with-fourier-transforms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fnet-mixing-tokens-with-fourier-transforms/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=fnet-mixing-tokens-with-fourier-transforms)`

FNet: Mixing Tokens with Fourier Transforms

NAACL 2022 · James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon ·

We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

PDF Abstract NAACL 2022 PDF NAACL 2022 Abstract

Code

Add Remove Mark official

google-research/google-research official

32,717

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

47,331

facebookresearch/xformers

↳ Quickstart in

Colab

7,486

rishikksh20/FNet-pytorch

245

erksch/fnet-pytorch

See all 14 implementations

Tasks

Add Remove

Linguistic Acceptability

Machine Translation

Natural Language Inference

Paraphrase Identification

Semantic Textual Similarity

Sentiment Analysis

Text Classification

Transfer Learning

Datasets

GLUE

SST

MultiNLI SST-2

QNLI

MRPC

CoLA LRA

Quora

Quora Question Pairs RTE STS Benchmark

Results from the Paper

Edit

Ranked #5 on Paraphrase Identification on Quora Question Pairs

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Linguistic Acceptability	CoLA	FNet-Large	Accuracy	78%	# 9	Compare
Semantic Textual Similarity	MRPC	FNet-Large	Accuracy	88%	# 23	Compare
Natural Language Inference	MultiNLI	FNet-Large	Matched	78	# 42	Compare
Natural Language Inference	MultiNLI	FNet-Large	Mismatched	76	# 33	Compare
Natural Language Inference	MultiNLI	BERT-Large	Matched	88	# 15	Compare
Natural Language Inference	MultiNLI	BERT-Large	Mismatched	88	# 10	Compare
Natural Language Inference	QNLI	FNet-Large	Accuracy	85%	# 40	Compare
Paraphrase Identification	Quora Question Pairs	FNet-Large	F1	85	# 5	Compare
Natural Language Inference	RTE	FNet-Large	Accuracy	69%	# 57	Compare
Sentiment Analysis	SST-2 Binary classification	FNet-Large	Accuracy	94	# 37	Compare
Semantic Textual Similarity	STS Benchmark	FNet-Large	Spearman Correlation	0.84	# 28	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BERT • BPE • Dense Connections • Dropout • GELU • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Weight Decay • WordPiece

Edit Social Preview

FNet: Mixing Tokens with Fourier Transforms

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove