VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home
8 PAPERS • 7 BENCHMARKS
Description: More than 2,000 Chinese native speakers participated in the recording with equal gender. Speakers are mainly from the southern China, and some of them are from the provinces of northern China with Strong accents. The recording content is rich, covering mobile phone voice assistant interaction, smart home command and control, In-car command and control, numbers and other fields, which is accurately matching the smart home, intelligent car and other practical application scenarios.
1 PAPER • NO BENCHMARKS YET
Description: 2,000 Changsha natives participated in the recording, covering multiple age groups, with a balanced gender distribution and authentic accent. The recorded text is rich in content, covering general, interactive, car, home and other categories. Local people in changsha check and proofread. The accuracy of sentences is 95%. It is mainly applied to speech recognition, machine translation and voiceprint recognition.
0 PAPER • NO BENCHMARKS YET
Description: Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
Description: It collects 463 Henan locals with authentic accent. The recording contents contain daily message and multi-fields customer consultation. It is checked and proofread by Henan locals to ensure the high accuracy. It recorded by android cellphone and iPhone.
Description: This dataset is recorded by 402 native Australian speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: The product contains the speech data recorded by 400 native Korean speakers, with roughly equal gender distribution. The corpus covers a wide domain with rich content of generic category, human-machine interaction category, in-car category, smart home category, etc. The corpus text was manually checked to ensure the high accuracy.
Description: This dataset is recorded by 452 native Singaporean speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: The data collected 203 Taiwan people, covering Taipei, Kaohsiung, Taichung, Tainan, etc. 137 females, 66 males. It is recorded in quiet indoor environment. It can be used in speech recognition, machine translation, voiceprint recognition model training and algorithm research.
Description: 532 Portuguese recorded in a relatively quiet environment in authentic English. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.
Description: 497 Italians recorded in a relatively quiet environment in authentic English. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.
Description: This dataset is recorded by 498 native Russian speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: This data set contains 349 English speaker's speech data, all of whom are English locals. The recording environment is quiet. The recorded content includes many fields such as car, home, voice assistant, etc. About 50 sentences per person. Valid data is 9.5 hours. All texts are manually transcribed with high accuracy.
Description: 891 Spanish native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.
Description: 500 Hours - Indian English Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: 505 Hours - Uyghur Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: 1089 French native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.
Description: It collects 2,034 local Chinese from 26 provinces like Henan, Shanxi, Sichuan, Hunan, Fujian, etc. It is mandarin speech data with heavy accent. The recoring contents are finance and economics, entertainment, policy, news, TV, and movies.
Description: It collects 312 speakers from northeast regione. All speakers are reading the texts in northeast dialect. The recording contents cover customer consultant and message text from nearly 30 fields. Sentences are manually transcribed and proofread by professional annotators, with high accuracy.
Description: It collects 2,507 speakers from Sichuan Basin and is recorded in quiet indoor environment. The recorded content covers customer consultation and text messages in many fields. The average number of repetitions is 1.3 and the average sentence length is 12.5 words. Sichuan natives participate in quality inspection and proofreading to ensure the accuracy of the text transcription.
Description: 1842 American native speakers participated in the recording with authentic accent. The recorded script is designed by linguists, based on scenes, and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.
Description: 1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene. We transcribed the speech into text manually to ensure high accuracy.
Description: Mobile phone captured audio data of Wuhan dialect, 997 hours in total, recorded by more than 2,000 Wuhan dialect native speakers. The recorded text covers generic, interactive, on-board, home and other categories, with rich contents. Wuhan locals participate in quality check and proofreading. Sentence accuracy rate reaches 95 %; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
995 local Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
Description: Interspeech2,020 Accented English Speech Recognition Competition Data. The text has been proofread manually with high accuracy; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
The data set contains 302 North American speakers' speech data. The recording contents include phrases and sentences with rich scenes. The valid time is 201 hours. The recording environment is quiet indoor. The recording device includes PC, android cellphone, and iPhone. This data can be used in speech recognition research in North American area.