Natural Language Processing

Parallel Corpus Mining

8 papers with code • 0 benchmarks • 1 datasets

Mining a corpus of bilingual sentence pairs that are translations of each other.

Benchmarks

Add a Result

These leaderboards are used to track progress in Parallel Corpus Mining

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Libraries

Use these libraries to find Parallel Corpus Mining models and implementations

facebookresearch/LASER

2 papers

3,520

Datasets

ASLG-PC12

Latest papers

Most implemented Social Latest No code

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

shyyhs/CourseraParallelCorpusMining • 7 Nov 2023

To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences.

07 Nov 2023

Paper
Code

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

potamides/unsupervised-metrics • • 21 Feb 2022

We show that our fully unsupervised metrics are effective, i. e., they beat supervised competitors on 4 out of our 5 evaluation datasets.

21 Feb 2022

Paper
Code

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

marian-nmt/marian • ACL 2020

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software.

1,170

01 Jul 2020

Paper
Code

Parallel Sentence Mining by Constrained Decoding

marian-nmt/marian-dev • ACL 2020

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation.

247

01 Jul 2020

Paper
Code

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

facebookresearch/muss • LREC 2022

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.

01 May 2020

Paper
Code

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

shyyhs/CourseraParallelCorpusMining • LREC 2020

To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera.

26 Dec 2019

Paper
Code

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

facebookresearch/LASER • • TACL 2019

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.

3,520

26 Dec 2018

Paper
Code

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

facebookresearch/LASER • • ACL 2019

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora.

3,520

03 Nov 2018

Paper
Code

Parallel Corpus Mining

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result