WordPiece

Introduced by Wu et al. in Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

WordPiece is a subword segmentation algorithm used in natural language processing. The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:

Initialize the word unit inventory with all the characters in the text.
Build a language model on the training data using the inventory from 1.
Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.

Text: Source

Image: WordPiece as used in BERT

Source: Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Retrieval	120	12.99%
Language Modelling	100	10.82%
Question Answering	63	6.82%
Large Language Model	39	4.22%
Sentence	30	3.25%
Text Classification	28	3.03%
Sentiment Analysis	26	2.81%
Text Generation	23	2.49%
Information Retrieval	22	2.38%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Subword Segmentation

Tokenizers