SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.
Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text ProcessingPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 91 | 8.13% |
Language Modeling | 74 | 6.61% |
Question Answering | 47 | 4.20% |
Sentence | 44 | 3.93% |
Decoder | 41 | 3.66% |
Text Generation | 32 | 2.86% |
Translation | 31 | 2.77% |
Machine Translation | 29 | 2.59% |
Retrieval | 26 | 2.32% |