Tokenizers

SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.

Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 96 9.29%
Question Answering 58 5.61%
Sentence 47 4.55%
Decoder 44 4.26%
Text Generation 39 3.78%
Translation 31 3.00%
Retrieval 30 2.90%
Machine Translation 28 2.71%
Natural Language Understanding 21 2.03%

Components


Component Type
BPE
Subword Segmentation

Categories