BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

9 Oct 2014  ·  Stephan Gouws, Yoshua Bengio, Greg Corrado ·

We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Document Classification Reuters De-En BilBOWA Accuracy 75 # 1
Document Classification Reuters En-De BilBOWA Accuracy 86.5 # 1

Methods


No methods listed for this paper. Add relevant methods here