Tokenization

92 papers with code • 1 benchmarks • 7 datasets

Splitting a string into parts, i.e., tokens.

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

11 Mar 2021

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step.

55,060

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

23 Jun 2021

In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.

20,702

Perceiver IO: A General Architecture for Structured Inputs & Outputs

30 Jul 2021

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size.

Ranked #1 on Optical Flow Estimation on KITTI 2015 (Average End-Point Error metric)

8,972

8,234

SimMIM: A Simple Framework for Masked Image Modeling

18 Nov 2021

We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.

6,944

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.

6,944

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.

5,854

Joint CTC/attention decoding for end-to-end speech recognition

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process.

1,559

The RWTH Aachen University Supervised Machine Translation Systems for WMT 2018

In total we improve by 6. 8{\%} BLEU over our last year{'}s submission and by 4. 8{\%} BLEU over the winning system of the 2017 German→English task.

1,032

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE).

1,018