Byte Pair Encoding

Introduced by Sennrich et al. in Neural Machine Translation of Rare Words with Subword Units

Byte Pair Encoding, or BPE, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).

Lei Mao has a detailed blog post that explains how this works.

Source: Neural Machine Translation of Rare Words with Subword Units

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	59	7.75%
Retrieval	35	4.60%
Question Answering	29	3.81%
Large Language Model	28	3.68%
Semantic Segmentation	22	2.89%
In-Context Learning	15	1.97%
Object Detection	14	1.84%
Sentence	12	1.58%
Denoising	11	1.45%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Subword Segmentation