Parallel Corpus Mining
8 papers with code • 0 benchmarks • 1 datasets
Mining a corpus of bilingual sentence pairs that are translations of each other.
Benchmarks
These leaderboards are used to track progress in Parallel Corpus Mining
Libraries
Use these libraries to find Parallel Corpus Mining models and implementationsMost implemented papers
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora.
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software.
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera.
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.
Parallel Sentence Mining by Constrained Decoding
We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation.
USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation
We show that our fully unsupervised metrics are effective, i. e., they beat supervised competitors on 4 out of our 5 evaluation datasets.
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts
To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences.