FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.
1 PAPER • NO BENCHMARKS YET
Fongbe Data collected by Fréjus A. A LALEYE
1 PAPER • 1 BENCHMARK
PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.
PolyNews is a multilingual parallel dataset containing news titles 833 language pairs, spanning in 64 languages and 17 scripts.
This dataset was created for Fongbe automatic speech recognition task and contains about 3979 recordings of 13 participants reading a text written in Fongbe, one sentence at a time. Fongbe is a vernacular language spoken mainly in Benin, by more than 50% of the population, and a littke in Togo and in Nigeria. It’s an under-resourced because it lacks linguistics resources (speech corpus and text data) and very few websites provide textual data. In this dataset, each example contains the audio files and the associated text. The audio is high-quality (16-bit, 16kHz) recorded using an adroid app that we built for the need. The dataset is multi-speaker, containing recordings from 13 volunteers (male and female).
0 PAPER • NO BENCHMARKS YET