TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Linguistic Acceptability	CoLA	Charformer-Tall	Accuracy	51.8%	# 36
Semantic Textual Similarity	MRPC	Charformer-Tall	Accuracy	87.5%	# 25
Semantic Textual Similarity	MRPC	Charformer-Tall	F1	91.4	# 7
Natural Language Inference	MultiNLI	Charformer-Tall	Matched	83.7	# 33
Natural Language Inference	MultiNLI	Charformer-Tall	Mismatched	84.4	# 20
Natural Language Inference	QNLI	Charformer-Tall	Accuracy	91.0%	# 31
Paraphrase Identification	Quora Question Pairs	Charformer-Tall	Accuracy	91.4	# 2
Paraphrase Identification	Quora Question Pairs	Charformer-Tall	F1	88.5	# 3
Sentiment Analysis	SST-2 Binary classification	Charformer-Base	Accuracy	91.6	# 50
Semantic Textual Similarity	STS Benchmark	Charformer-Tall	Pearson Correlation	0.873	# 24

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/paraphrase-identification-on-quora-question)](https://paperswithcode.com/sota/paraphrase-identification-on-quora-question?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=charformer-fast-character-transformers-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/charformer-fast-character-transformers-via/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=charformer-fast-character-transformers-via)`

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

ICLR 2022 · Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler ·

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

google-research/google-research official

32,808

lucidrains/charformer-pytorch

119

Tasks

Add Remove

Inductive Bias

Linguistic Acceptability

Natural Language Inference

Paraphrase Identification

Semantic Textual Similarity

Sentiment Analysis

Datasets

GLUE

SST

MultiNLI

IMDb Movie Reviews SST-2

QNLI

AG News

MRPC

CoLA

XNLI

XQuAD

PAWS-X

TyDiQA

MLQA Civil Comments mC4

Quora

Quora Question Pairs STS Benchmark

TyDiQA-GoldP

Results from the Paper

Edit

Ranked #3 on Paraphrase Identification on Quora Question Pairs

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Linguistic Acceptability	CoLA	Charformer-Tall	Accuracy	51.8%	# 36	Compare
Semantic Textual Similarity	MRPC	Charformer-Tall	Accuracy	87.5%	# 25	Compare
Semantic Textual Similarity	MRPC	Charformer-Tall	F1	91.4	# 7	Compare
Natural Language Inference	MultiNLI	Charformer-Tall	Matched	83.7	# 33	Compare
Natural Language Inference	MultiNLI	Charformer-Tall	Mismatched	84.4	# 20	Compare
Natural Language Inference	QNLI	Charformer-Tall	Accuracy	91.0%	# 31	Compare
Paraphrase Identification	Quora Question Pairs	Charformer-Tall	Accuracy	91.4	# 2	Compare
Paraphrase Identification	Quora Question Pairs	Charformer-Tall	F1	88.5	# 3	Compare
Sentiment Analysis	SST-2 Binary classification	Charformer-Base	Accuracy	91.6	# 50	Compare
Semantic Textual Similarity	STS Benchmark	Charformer-Tall	Pearson Correlation	0.873	# 24	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Charformer • Dense Connections • Dropout • GBST • Gradient-Based Subword Tokenization • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove