VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
458 PAPERS • 9 BENCHMARKS
Consists of more than 210k videos for 310 audio classes.
61 PAPERS • NO BENCHMARKS YET
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
39 PAPERS • 1 BENCHMARK
Description: More than 2,000 Chinese native speakers participated in the recording with equal gender. Speakers are mainly from the southern China, and some of them are from the provinces of northern China with Strong accents. The recording content is rich, covering mobile phone voice assistant interaction, smart home command and control, In-car command and control, numbers and other fields, which is accurately matching the smart home, intelligent car and other practical application scenarios.
1 PAPER • NO BENCHMARKS YET
Description: The dataset contains 200 Chinese native speakers, covering main dialect zones. It is recorded in both noisy and quiet environment and more suitable for the actual application scenario for speech recognition. The recordings are commonly used spoken sentences. Texts are transcribed by professional annotators. It can be used for speech recognition and machine translation.
The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes.
1 PAPER • 2 BENCHMARKS
MAVS is an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset contains 103 subjects captured in three different sessions considering the different real-world scenarios. Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems.
Description: 2,284 native speakers of Kunming dialect participated in the recording, with authentic accent and from multiple age groups. The recorded script covers a wide range of topics such as generic, interactive, on-board, and home. Local people in Kunming participated in quality check and proofreading, and the text was transferred accurately. It matches with mainstream Android and Apple system phones.
0 PAPER • NO BENCHMARKS YET
Description: Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
Description: More than 1,000 recorders read the specified wake-up words, covering slow, normal, and fast three speeds. Audios are recorded in the professional recording studio using the microphone.
The data were recorded by 700 Mandarin speakers, 65% of whom were women. There is no pre-made text, and speakers makes phone calls in a natural way while recording the contents of the calls. This data mainly labels the near-end speech, and the speech content is naturally colloquial.
Description: German audio data captured by mobile phone, 1,796 hours in total, recorded by 3,442 German native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data can be used for automatic speech recognition, machine translation, and voiceprint recognition.
Format: 16kHz, 16bit, uncompressed wav, mono channel
Description: This dataset is recorded by 402 native Australian speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: 4,787 Chinese native speakers participated in the recording with equal gender. Speakers are from various provinces of China. The recording content is rich, covering mobile phone voice assistant interaction, smart home command and control, In-car command and control, numbers, and other fields, which is accurately matching the smart home, intelligent car, and other practical application scenarios.
Description: English emotional audio data captured by microphone, 20 American native speakers participate in the recording, 2,100 sentences per person; the recorded script covers 10 emotions such as anger, happiness, sadness; the voice is recorded by high-fidelity microphone therefore has high quality; it is used for analytical detection of emotional speech.
Description: The product contains the speech data recorded by 400 native Korean speakers, with roughly equal gender distribution. The corpus covers a wide domain with rich content of generic category, human-machine interaction category, in-car category, smart home category, etc. The corpus text was manually checked to ensure the high accuracy.
Description: This dataset is recorded by 452 native Singaporean speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: 532 Portuguese recorded in a relatively quiet environment in authentic English. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.
Description: 497 Italians recorded in a relatively quiet environment in authentic English. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.
Description: The data volumn is 227 hours. It is recorded by Spanish native speakers from Spain, Mexico and Venezuela. It is recorded in quiet environment. The recording contents cover various fields like economy, entertainment, news and spoken language. All texts are manually transcribed. The sentence accurate is 95%.
Description: This dataset is recorded by 498 native Russian speakers with a balanced gender. It is rich in content and it covers generic command and control;human-machine interaction; smart home command and control;in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy.
Description: The data volume is 231 hours and is recorded by 406 speakers (from French, Canada, and Africa). The recording is in quiet environment and rich in content. It contains various fields like economics, entertainment, news, and spoken language. All texts are manually transcribed. The sentence accuracy rate is 95%.
Description: The data is 240 hours and is recorded by 401 Indian. It is recorded in both quiet and noisy environment, which is more suitable for the actual application scenario. The recording content is rich, covering economic, entertainment, news, spoken language, etc. All texts are manually transferred, with high accuracy. It can be applied to speech recognition, machine translation, voiceprint recognition.
Description: 1006 Japanese native speakers participated in the recording, coming from eastern, western, and Kyushu regions, while the eastern region accounting for the largest proportion. The recording content is rich and all texts have been manually transferred with high accuracy.
Description: Mobile phone captured audio data of Chinese children, with total duration of 3,255 hours. 9,780 speakers are children aged 6 to 12, with accent covering seven dialect areas; the recorded text contains common children languages such as essay stories, numbers, and their interactions on cars, at home, and with voice assistants, precisely matching the actual application scenes. All sentences are manually transferred with high accuracy.
Description: 300 Hours - Tibetan Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: The 338-hour Spanish speech data and is recorded by 800 Spanish-speaking native speakers from Spain, Mexico, Argentina. The recording enviroment is queit. All texts are manually transcribed.The sentence accuracy rate is 95%. It can be applied to speech recognition, machine translation, voiceprint recognition and so on.
Description: This speech data is collected from 343 Spanish native speakers who from Spain, Mexico and Argentina. 50 sentences for each speaker, total 9.9 hours. The recording environment is quiet. Alltexts are amnually transcribed with high accuracy. Recording devices are mainstream Android phones and iPhones. It can be used for speech recogntion, machine translation and voiceprint recognition.
Description: Italian languageaudio data captured by mobile phone , with total duration of 347 hours. It is recorded by 800 Italian native speakers, balanced in gender is balanced; the recording environment is quiet; all texts are manually transferred with high accuracy. This data set can be applied on automatic speech recognition, machine translation, and sound pattern recognition.
Description: This data set contains 349 English speaker's speech data, all of whom are English locals. The recording environment is quiet. The recorded content includes many fields such as car, home, voice assistant, etc. About 50 sentences per person. Valid data is 9.5 hours. All texts are manually transcribed with high accuracy.
Description: Audiobook annotated pinyin audio data, with duration of 35 hours; 5 speakers are recorded including 3 males and 2 females; Chinese characters and pinyin are annotated, including the tone of pinyin; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
Description: The data were collected and recorded by 351 German native speakers with authentic accents. Recording devices are mainstream Android phones and iPhones. The recorded text is designed by professional language experts and is rich in content, covering multiple categories such as general purpose, interactive, vehicle-mounted and household commands. The recording environment is quiet and without echo. The texts are manually transcribed with a high accuracy rate. Recording devices are mainstream Android phones and iPhones.
Description: Italian speech data (guiding) is collected from 351 Italian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as in-car scene, smart home, speech assistant. 50 sentences for each speaker. The valid volumn is 9.8 hours. Each sentence is repeated 2.7 times on average. All texts are manual transcribed with high accuray.
Description: 357 hours of Korean speech data collected by cellphone. It is recorded by 999 Korean in quiet environment and is rich in content. All texts are transtribed by professional annotator. The accuracy rate of sentence is 95%. It can be used for speech recognition, machine translation and voiceprint recognition.
Description: 891 Spanish native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.
Description: The data is recorded by 397 Indian with authentic accent, 50 sentences for each speaker, total 8.6 hours. The recording content involves car scene, smart home, intelligent voice assistant. This data can be used for corpus construction of machine translation, model training and algorithm research for voiceprint recognition.
Description: 401 speakers participate in this recording. 50 sentences for each speaker, total 10.9 hours. Recording texts include in-car scene, smart home, smart speech assistant. Texts are accurate after manually transcribed. Recording devices are mainstream Android phones and iPhones. It can be used for in-car scene, smart home and speech assistant.
Description: Children read English audio data, covering ages from preschool (3-5 years old) to post-school (6-12 years old) , with children's speech features; content accurately matches children's actual scenes of speaking English. It provides data support for children's smart home, automatic speech recognition and oral assessment in intelligent education scene, .
Description: Recording devices are mainstream Android phones and iPhones.
Description: Thai speech data (guiding) is collected from 490 Thailand native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as in-car scene, smart home, speech assistant. 50 sentences for each speaker. The valid volumn is 15 hours. All texts are manual transcribed with high accuray.
Description: 500 Hours - Filipino Speech Data by Mobile Phone,the data were recorded by Filipino speakers with authentic Filipino accents.The text is manually proofread with high accuracy. Match mainstream Android, Apple system phones.
Description: 500 Hours - Indian English Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
About 700 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
About 1000 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
About 700 Korean speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.
Description: Korean audio data with duration of 516 hours, recorded texts include: daily language, various interactive sentences, home commands, on-board commands, etc. Among 1,077 speakers, male and female speakers are 49% and 51%. The duration of each speaker is around half an hour.
Description: 1089 French native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.