Subword Segmentation

Byte Pair Encoding

Introduced by Sennrich et al. in Neural Machine Translation of Rare Words with Subword Units

Byte Pair Encoding, or BPE, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).

Lei Mao has a detailed blog post that explains how this works.

Source: Neural Machine Translation of Rare Words with Subword Units

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 59 7.75%
Retrieval 35 4.60%
Question Answering 29 3.81%
Large Language Model 28 3.68%
Semantic Segmentation 22 2.89%
In-Context Learning 15 1.97%
Object Detection 14 1.84%
Sentence 12 1.58%
Denoising 11 1.45%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories