Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

PDF Abstract ACL 2020 PDF ACL 2020 Abstract

Datasets


Introduced in the Paper:

French Wikipedia

Used in the Paper:

MultiNLI XNLI CCNet

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Dependency Parsing French GSD CamemBERT LAS 92.47 # 1
UAS 94.82 # 1
Part-Of-Speech Tagging French GSD CamemBERT UPOS 98.19 # 1
Named Entity Recognition (NER) French Treebank CamemBERT (subword masking) F1 87.93 # 1
Precision 88.35 # 1
Recall 87.46 # 1
Dependency Parsing ParTUT CamemBERT LAS 92.9 # 1
UAS 95.21 # 1
Part-Of-Speech Tagging ParTUT CamemBERT UPOS 97.63 # 1
Part-Of-Speech Tagging Sequoia Treebank CamemBERT UPOS 99.21 # 1
Dependency Parsing Sequoia Treebank CamemBERT LAS 94.39 # 1
UAS 95.56 # 1
Part-Of-Speech Tagging Spoken Corpus CamemBERT UPOS 96.68 # 1
Dependency Parsing Spoken Corpus CamemBERT LAS 81.37 # 1
UAS 86.05 # 1
Natural Language Inference XNLI French CamemBERT Accuracy 81.2 # 2

Methods


No methods listed for this paper. Add relevant methods here