Abusive language in Spanish children and young teenager's conversations: data preparation and short text classification with contextual word embeddings

LREC 2020 · Marta R. Costa-juss{\`a}, Esther Gonz{\'a}lez, Asuncion Moreno, Eudald Cumalat ·

Abusive texts are reaching the interests of the scientific and social community. How to automatically detect them is onequestion that is gaining interest in the natural language processing community. The main contribution of this paper is toevaluate the quality of the recently developed {''}Spanish Database for cyberbullying prevention{''} for the purpose of trainingclassifiers on detecting abusive short texts. We compare classical machine learning techniques to the use of a more ad-vanced model: the contextual word embeddings in the particular case of classification of abusive short-texts for the Spanishlanguage. As contextual word embeddings, we use Bidirectional Encoder Representation from Transformers (BERT), pro-posed at the end of 2018. We show that BERT mostly outperforms classical techniques. Far beyond the experimentalimpact of our research, this project aims at planting the seeds for an innovative technological tool with a high potentialsocial impact and aiming at being part of the initiatives in artificial intelligence for social good.

PDF Abstract