A Corpus for Multilingual Document Classification in Eight Languages

LREC 2018  ·  Holger Schwenk, Xi-An Li ·

Cross-lingual document classification aims at training a document classifier on resources in one language and transferring it to a different language without any additional resources. Several approaches have been proposed in the literature and the current best practice is to evaluate them on a subset of the Reuters Corpus Volume 2. However, this subset covers only few languages (English, German, French and Spanish) and almost all published works focus on the the transfer between English and German. In addition, we have observed that the class prior distributions differ significantly between the languages. We argue that this complicates the evaluation of the multilinguality. In this paper, we propose a new subset of the Reuters corpus with balanced class priors for eight languages. By adding Italian, Russian, Japanese and Chinese, we cover languages which are very different with respect to syntax, morphology, etc. We provide strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively. Our goal is to offer a freely available framework to evaluate cross-lingual document classification, and we hope to foster by these means, research in this important area.

PDF Abstract LREC 2018 PDF LREC 2018 Abstract

Datasets


Introduced in the Paper:

MLDoc
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Chinese BiLSTM (UN) Accuracy 71.97 # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Chinese MultiCCA + CNN Accuracy 74.73 # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-French BiLSTM (Europarl) Accuracy 72.83 # 5
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-French BiLSTM (UN) Accuracy 74.52 # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-French MultiCCA + CNN Accuracy 72.38 # 6
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-German MultiCCA + CNN Accuracy 81.2% # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-German BiLSTM (Europarl) Accuracy 71.83% # 5
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Italian MultiCCA + CNN Accuracy 69.38 # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Italian BiLSTM (Europarl) Accuracy 60.73 # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Japanese MultiCCA + CNN Accuracy 67.63 # 2
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Russian BiLSTM (UN) Accuracy 61.42 # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Russian MultiCCA + CNN Accuracy 60.8 # 5
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Spanish MultiCCA + CNN Accuracy 72.5 # 4
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Spanish BiLSTM (UN) Accuracy 69.5 # 5
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Spanish BiLSTM (Europarl) Accuracy 66.65 # 6
Cross-Lingual Document Classification MLDoc Zero-Shot German-to-French BiLSTM (Europarl) Accuracy 75.45 # 1

Methods


No methods listed for this paper. Add relevant methods here