RoBERT -- A Romanian BERT Model

COLING 2020  ·  Mihai Masala, Stefan Ruseti, Mihai Dascalu ·

Deep pre-trained language models tend to become ubiquitous in the field of Natural Language Processing (NLP). These models learn contextualized representations by using a huge amount of unlabeled text data and obtain state of the art results on a multitude of NLP tasks, by enabling efficient transfer learning. For other languages besides English, there are limited options of such models, most of which are trained only on multi-lingual corpora. In this paper we introduce a Romanian-only pre-trained BERT model {--} RoBERT {--} and compare it with different multi-lingual models on seven Romanian specific NLP tasks grouped into three categories, namely: sentiment analysis, dialect and cross-dialect topic identification, and diacritics restoration. Our model surpasses the multi-lingual models, as well as a another mono-lingual implementation of BERT, on all tasks.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods