Tokenization

92 papers with code • 1 benchmarks • 7 datasets

Splitting a string into parts, i.e., tokens.

Greatest papers with code

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

huggingface/transformers 11 Mar 2021

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step.

Tokenization

Perceiver IO: A General Architecture for Structured Inputs & Outputs

deepmind/deepmind-research 30 Jul 2021

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size.

 Ranked #1 on Optical Flow Estimation on KITTI 2015 (Average End-Point Error metric)

Optical Flow Estimation Starcraft +2

SimMIM: A Simple Framework for Masked Image Modeling

lucidrains/vit-pytorch 18 Nov 2021

We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.

Fine-tuning Representation Learning +1

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

lucidrains/vit-pytorch ICCV 2021

To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.

Image Classification Language Modelling +1

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

stanfordnlp/stanza ACL 2020

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.

Coreference Resolution Dependency Parsing +4

Joint CTC/attention decoding for end-to-end speech recognition

PaddlePaddle/PaddleSpeech ACL 2017

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process.

automatic-speech-recognition End-To-End Speech Recognition +2

The RWTH Aachen University Supervised Machine Translation Systems for WMT 2018

awslabs/sockeye WS 2018

In total we improve by 6. 8{\%} BLEU over our last year{'}s submission and by 4. 8{\%} BLEU over the winning system of the 2017 German→English task.

Fine-tuning Machine Translation +2

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

bheinzerling/bpemb LREC 2018

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE).

Entity Typing Tokenization +1