Information-theoretic Vocabularization via Optimal Transport

1 Jan 2021 · Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, Lei LI ·

It is well accepted that the choice of token vocabulary largely affects the performance in NLP tasks. One dominant approach to construct a good vocabulary is the Byte Pair Encoding method (BPE). However, due to expensive trial cost, prior research have rarely tried to search for best token dictionary and its size, other than simple trials of BPE with commonly used vocabulary size (e.g. 30K). In this paper, we find an exciting relation between an information-theoretic feature and the performance of NLP tasks such as machine translation with a given vocabulary. With this observation, we formulate the quest of vocabularization -- finding the best token dictionary with a proper size -- as an optimal transport problem. We then propose info-VOT, a simple and efficient solution without the full and costly trial training on the downstream task. We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, WMT-16 English-Romanian translation, and TED translation. Empirical results show that our approach beats widely-used vocabulary construction methods with only a fifth of the number of tokens.

PDF Abstract