Tokenizers

SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.

Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 91 8.13%
Language Modeling 74 6.61%
Question Answering 47 4.20%
Sentence 44 3.93%
Decoder 41 3.66%
Text Generation 32 2.86%
Translation 31 2.77%
Machine Translation 29 2.59%
Retrieval 26 2.32%

Components


Component Type
BPE
Subword Segmentation

Categories