SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

EMNLP 2018 Taku KudoJohn Richardson

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.