Parallel Corpus Mining
8 papers with code • 0 benchmarks • 1 datasets
Mining a corpus of bilingual sentence pairs that are translations of each other.
These leaderboards are used to track progress in Parallel Corpus Mining
LibrariesUse these libraries to find Parallel Corpus Mining models and implementations
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora.
To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera.
Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.
We show that our fully unsupervised metrics are effective, i. e., they beat supervised competitors on 4 out of our 5 evaluation datasets.
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts
To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences.