Search Results for author: Omri Uzan

Found 3 papers, 2 papers with code

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

1 code implementation • 20 Apr 2024 • Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

text-classification Text Classification

Paper
Code

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

1 code implementation • 2 Mar 2024 • Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.

Paper
Code

Tokenization Is More Than Compression

no code implementations • 28 Feb 2024 • Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.

Data Compression

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.