CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
32 PAPERS • NO BENCHMARKS YET
CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems
18 PAPERS • NO BENCHMARKS YET
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.
3 PAPERS • NO BENCHMARKS YET
Data collection was conducted by asking some adults from social media and some students from an elementary school to participate in our experiment. Table.1 shows the number of data gathered for recognizing each color. Due to the fact that two words are used for black in Persian, the number of black samples is more. In addition, because the color recognition is a RAN task, a sequence of data has been gathered. Table.2 depicts the number of sequence data for colors. For the meaningless words, 12 voices have been gathered on average for each word (there are 40 meaningless words in this task).
1 PAPER • NO BENCHMARKS YET
A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.
The Persian Consonant Vowel Combination (PCVC) dataset is a phoneme based speech dataset, and also the first free Persian speech dataset to help Persian speech researchers. This dataset contains of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples. The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. In each sample, sound starts with consonant and then there is a vowel sound and at last there is silent. length of silence is dependent on length of combination of consonant and vowel. For example if combination ends in 20000th data sample, so the rest of 10000 sample (until 30000, the length of each sound sample) are silence.
0 PAPER • NO BENCHMARKS YET