4 dataset results for Text Generation AND Texts AND French

The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.

51 PAPERS • NO BENCHMARKS YET

CONAN (COunter NArratives through Nichesourcing)

COunter NArratives through Nichesourcing (CONAN) is a dataset that consists of 4,078 pairs over the 3 languages. Additionally, 3 types of metadata are provided: expert demographics, hate speech sub-topic and counter-narrative type. The dataset is augmented through translation (from Italian/French to English) and paraphrasing, which brought the total number of pairs to 14.988.

21 PAPERS • NO BENCHMARKS YET

Opusparcus

Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

15 PAPERS • NO BENCHMARKS YET

CLSE (Corpus of Linguistically Significant Entities)

CLSE is an augmented version of the Schema-Guided Dialog Dataset. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games.

2 PAPERS • NO BENCHMARKS YET

Datasets

4 dataset results for Text Generation AND Texts AND French