We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages.
Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
CROSS-LINGUAL BITEXT MINING CROSS-LINGUAL DOCUMENT CLASSIFICATION CROSS-LINGUAL NATURAL LANGUAGE INFERENCE CROSS-LINGUAL TRANSFER DOCUMENT CLASSIFICATION JOINT MULTILINGUAL SENTENCE REPRESENTATIONS PARALLEL CORPUS MINING
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora.
#2 best model for Cross-Lingual Bitext Mining on BUCC German-to-English
Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing.
For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance.
SOTA for Text Classification on TREC-6
The analysis sheds light on the relative strengths of different sentence embedding methods with respect to these low level prediction tasks, and on the effect of the encoded vector's dimensionality on the resulting representations.
Despite the fast developmental pace of new sentence embedding methods, it is still challenging to find comprehensive evaluations of these different techniques.
The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i. e. semantic representations) of word sequences as well.
We consider the problem of Recognizing Textual Entailment within an Information Retrieval context, where we must simultaneously determine the relevancy as well as degree of entailment for individual pieces of evidence to determine a yes/no answer to a binary natural language question.