Investigating the Effectiveness of BPE: The Power of Shorter Sequences

IJCNLP 2019  ·  Matthias Gall{\'e} ·

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods