SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.
Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text ProcessingPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 89 | 9.71% |
Question Answering | 69 | 7.52% |
Text Generation | 45 | 4.91% |
Machine Translation | 30 | 3.27% |
Natural Language Understanding | 27 | 2.94% |
Sentiment Analysis | 20 | 2.18% |
Reading Comprehension | 19 | 2.07% |
Semantic Parsing | 17 | 1.85% |
Abstractive Text Summarization | 17 | 1.85% |