Learning Word Vectors for 157 Languages

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

PDF Abstract LREC 2018 PDF LREC 2018 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Only Connect Walls Dataset Task 1 (Grouping) OCW FastText (Crawl) Wasserstein Distance (WD) 84.2 ± .5 # 12
# Correct Groups 80 ± 4 # 13
Fowlkes Mallows Score (FMS) 32.1 ± .3 # 13
Adjusted Rand Index (ARI) 15.2 ± .3 # 13
Adjusted Mutual Information (AMI) 18.4 ± .4 # 13
# Solved Walls 0 ± 0 # 10
Only Connect Walls Dataset Task 1 (Grouping) OCW FastText (News) Wasserstein Distance (WD) 85.5 ± .5 # 15
# Correct Groups 62 ± 3 # 16
Fowlkes Mallows Score (FMS) 30.4 ± .2 # 15
Adjusted Rand Index (ARI) 13.0 ± .2 # 15
Adjusted Mutual Information (AMI) 15.8 ± .3 # 15
# Solved Walls 0 ± 0 # 10

Methods


No methods listed for this paper. Add relevant methods here