Tokenization

28 papers with code · Natural Language Processing

State-of-the-art leaderboards

No evaluation results yet. Help compare methods by submit evaluation metrics.

Greatest papers with code

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

LREC 2018 bheinzerling/bpemb

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE).

ENTITY TYPING TOKENIZATION WORD EMBEDDINGS

NLP-Cube: End-to-End Raw Text Processing With Neural Networks

CONLL 2018 adobe/NLP-Cube

We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL{'}s {``}Multilingual Parsing from Raw Text to Universal Dependencies 2018{''} Shared Task.

LEMMATIZATION TOKENIZATION

A Call for Clarity in Reporting BLEU Scores

WS 2018 mjpost/sacreBLEU

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric.

MACHINE TRANSLATION TOKENIZATION

Juman++: A Morphological Analysis Toolkit for Scriptio Continua

EMNLP 2018 ku-nlp/jumanpp

We present a three-part toolkit for developing morphological analyzers for languages without natural word boundaries.

ART ANALYSIS LANGUAGE MODELLING MORPHOLOGICAL ANALYSIS PART-OF-SPEECH TAGGING TOKENIZATION