VOXLINGUA107: A DATASET FOR SPOKEN LANGUAGE RECOGNITION

25 Nov 2020  ·  Jorgen Valk, Tanel Alumae ·

This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Spoken language identification KALAKA-3 Model on the noisy data PC 0.055 # 2
PO 0.083 # 2
EC 0.033 # 2
EO 0.059 # 2
Spoken language identification KALAKA-3 Model on the automatically filtered (cleaned) data PC 0.041 # 1
PO 0.056 # 1
EC 0.022 # 1
EO 0.058 # 1
Spoken language identification LRE07 Kaldi i-vector 3 sec 26.04 # 9
10 sec 11.93 # 9
30 sec 4.52 # 9
Average 14.17 # 9
Spoken language identification LRE07 Kaldi i-vector DNN 3 sec 19.67 # 8
10 sec 7.84 # 8
30 sec 3.31 # 8
Average 10.27 # 8
Spoken language identification LRE07 GMM-MMI 3 sec 17.28 # 6
10 sec 5.90 # 6
30 sec 2.10 # 7
Average 8.42 # 6
Spoken language identification LRE07 CNN-SAP 3 sec 8.59 # 2
10 sec 2.49 # 1
30 sec 1.09 # 1
Average 4.06 # 2
Spoken language identification LRE07 CNN-LDE 3 sec 8.25 # 1
10 sec 2.61 # 2
30 sec 1.16 # 2
Average 4.00 # 1
Spoken language identification LRE07 Resnet34 (cleaned data) 3 sec 9.39 # 3
10 sec 3.14 # 3
30 sec 1.90 # 6
Average 4.81 # 3
Spoken language identification LRE07 Resnet34 (noisy data) 3 sec 10.58 # 4
10 sec 3.33 # 4
30 sec 1.72 # 5
Average 5.21 # 4
Spoken language identification LRE07 Fusion of models 3 sec 15.29 # 5
10 sec 4.54 # 5
30 sec 1.30 # 3
Average 7.04 # 5
Spoken language identification LRE07 Phonotactic 3 sec 18.59 # 7
10 sec 6.28 # 7
30 sec 1.34 # 4
Average 8.73 # 7
Spoken language identification VOXLINGUA107 Cleaned 0..5sec 13.4 # 2
5..20sec 6.6 # 2
Average 7.6 # 2
Spoken language identification VOXLINGUA107 Noisy 0..5sec 12.3 # 1
5..20sec 6.1 # 1
Average 7.1 # 1

Methods


No methods listed for this paper. Add relevant methods here