TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Extreme Summarization	GEM-XSum	mT5	BLEU score	14.3	# 2
Extreme Summarization	GEM-XSum	ByT5	BLEU score	15.3	# 1
Cross-Lingual Question Answering	MLQA	ByT5 XXL	EM	54.9	# 1
Cross-Lingual Question Answering	MLQA	ByT5 XXL	F1	71.6	# 1
Cross-Lingual Paraphrase Identification	PAWS-X	ByT5 XXL	Accuracy	90.1	# 1
Cross-Lingual Paraphrase Identification	PAWS-X	ByT5 Small	Accuracy	84	# 4
Question Answering	TweetQA	mT5	BLEU-1	70.8	# 2
Question Answering	TweetQA	mT5	ROUGE-L	74.3	# 2
Question Answering	TweetQA	ByT5 (small)	BLEU-1	72.0	# 1
Question Answering	TweetQA	ByT5	ROUGE-L	75.7	# 1
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 (fine-tuned)	EM	81.9	# 1
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 XXL	EM	60.0	# 5
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 XXL	F1	75.3	# 2
Cross-Lingual NER	WikiAnn NER	ByT5 XXL	F1	67.7	# 1
Cross-Lingual Natural Language Inference	XNLI	ByT5 XXL	Accuracy	83.7	# 1
Cross-Lingual Natural Language Inference	XNLI	ByT5 Small	Accuracy	69.1	# 4
Cross-Lingual Question Answering	XQuAD	ByT5 XXL	EM	63.6	# 1
Cross-Lingual Question Answering	XQuAD	ByT5 XXL	F1	79.7	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/extreme-summarization-on-gem-xsum)](https://paperswithcode.com/sota/extreme-summarization-on-gem-xsum?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-question-answering-on-mlqa)](https://paperswithcode.com/sota/cross-lingual-question-answering-on-mlqa?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-paraphrase-identification-on)](https://paperswithcode.com/sota/cross-lingual-paraphrase-identification-on?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/question-answering-on-tweetqa)](https://paperswithcode.com/sota/question-answering-on-tweetqa?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-question-answering-on-tydiqa)](https://paperswithcode.com/sota/cross-lingual-question-answering-on-tydiqa?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-ner-on-wikiann-ner)](https://paperswithcode.com/sota/cross-lingual-ner-on-wikiann-ner?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-natural-language-inference-on-4)](https://paperswithcode.com/sota/cross-lingual-natural-language-inference-on-4?p=byt5-towards-a-token-free-future-with-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/byt5-towards-a-token-free-future-with-pre/cross-lingual-question-answering-on-xquad)](https://paperswithcode.com/sota/cross-lingual-question-answering-on-xquad?p=byt5-towards-a-token-free-future-with-pre)`

ByT5: Towards a token-free future with pre-trained byte-to-byte models

28 May 2021 · Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ·

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

PDF Abstract

Code

Add Remove Mark official

google-research/byt5 official

467

huggingface/transformers

125,031

ufal/multilexnorm2021

↳ Quickstart in

Colab

yoreG123/Paddle-ByT5

Tasks

Add Remove

Cross-Lingual Natural Language Inference

Cross-Lingual NER

Cross-Lingual Paraphrase Identification

Cross-Lingual Question Answering

Extreme Summarization

Question Answering

Datasets

GLUE

XNLI

DROP

XQuAD

PAWS-X

TyDiQA

MLQA mC4 WikiANN

XTREME

GEM

TyDiQA-GoldP

TweetQA

Dakshina

Results from the Paper

Edit

Ranked #1 on Cross-Lingual Natural Language Inference on XNLI

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Extreme Summarization	GEM-XSum	mT5	BLEU score	14.3	# 2	Compare
Extreme Summarization	GEM-XSum	ByT5	BLEU score	15.3	# 1	Compare
Cross-Lingual Question Answering	MLQA	ByT5 XXL	EM	54.9	# 1	Compare
Cross-Lingual Question Answering	MLQA	ByT5 XXL	F1	71.6	# 1	Compare
Cross-Lingual Paraphrase Identification	PAWS-X	ByT5 XXL	Accuracy	90.1	# 1	Compare
Cross-Lingual Paraphrase Identification	PAWS-X	ByT5 Small	Accuracy	84	# 4	Compare
Question Answering	TweetQA	mT5	BLEU-1	70.8	# 2	Compare
Question Answering	TweetQA	mT5	ROUGE-L	74.3	# 2	Compare
Question Answering	TweetQA	ByT5 (small)	BLEU-1	72.0	# 1	Compare
Question Answering	TweetQA	ByT5	ROUGE-L	75.7	# 1	Compare
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 (fine-tuned)	EM	81.9	# 1	Compare
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 XXL	EM	60.0	# 5	Compare
Cross-Lingual Question Answering	TyDiQA-GoldP	ByT5 XXL	F1	75.3	# 2	Compare
Cross-Lingual NER	WikiAnn NER	ByT5 XXL	F1	67.7	# 1	Compare
Cross-Lingual Natural Language Inference	XNLI	ByT5 XXL	Accuracy	83.7	# 1	Compare
Cross-Lingual Natural Language Inference	XNLI	ByT5 Small	Accuracy	69.1	# 4	Compare
Cross-Lingual Question Answering	XQuAD	ByT5 XXL	EM	63.6	# 1	Compare
Cross-Lingual Question Answering	XQuAD	ByT5 XXL	F1	79.7	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adafactor • Adam • Attention Dropout • BPE • Dense Connections • Dropout • GELU • GLU • Inverse Square Root Schedule • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • SentencePiece • Softmax • T5 • Transformer

Edit Social Preview

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove