ArSarcasm-v2 is an extension of the original ArSarcasm dataset published along with the paper From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset. ArSarcasm-v2 conisists of ArSarcasm along with portions of DAICT corpus and some new tweets. Each tweet was annotated for sarcasm, sentiment and dialect. The final dataset consists of 15,548 tweets divided into 12,548 training tweets and 3,000 testing tweets. ArSarcasm-v2 was used and released as a part of the shared task on sarcasm detection and sentiment analysis in Arabic.
14 PAPERS • NO BENCHMARKS YET
ArSarcasm is a new Arabic sarcasm detection dataset. The dataset was created using previously available Arabic sentiment analysis datasets (SemEval 2017 and ASTD) and adds sarcasm and dialect labels to them. The dataset contains 10,547 tweets, 1,682 (16%) of which are sarcastic.
11 PAPERS • NO BENCHMARKS YET
FreCDo is a corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland.
1 PAPER • NO BENCHMARKS YET
Description: 2,284 native speakers of Kunming dialect participated in the recording, with authentic accent and from multiple age groups. The recorded script covers a wide range of topics such as generic, interactive, on-board, and home. Local people in Kunming participated in quality check and proofreading, and the text was transferred accurately. It matches with mainstream Android and Apple system phones.
0 PAPER • NO BENCHMARKS YET
Description: 2,000 Changsha natives participated in the recording, covering multiple age groups, with a balanced gender distribution and authentic accent. The recorded text is rich in content, covering general, interactive, car, home and other categories. Local people in changsha check and proofread. The accuracy of sentences is 95%. It is mainly applied to speech recognition, machine translation and voiceprint recognition.
Description: It collects 463 Henan locals with authentic accent. The recording contents contain daily message and multi-fields customer consultation. It is checked and proofread by Henan locals to ensure the high accuracy. It recorded by android cellphone and iPhone.
Description:
Description: 300 Hours - Tibetan Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: 500 Hours - Kazakh Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: 505 Hours - Uyghur Colloquial Video Speech Data, collected from real website, covering multiple fields. Various attributes such as text content and speaker identity are annotated. This data set can be used for voiceprint recognition model training, construction of corpus for machine translation and algorithm research.
Description: It collects 2,034 local Chinese from 26 provinces like Henan, Shanxi, Sichuan, Hunan, Fujian, etc. It is mandarin speech data with heavy accent. The recoring contents are finance and economics, entertainment, policy, news, TV, and movies.
Description: It collects 312 speakers from northeast regione. All speakers are reading the texts in northeast dialect. The recording contents cover customer consultant and message text from nearly 30 fields. Sentences are manually transcribed and proofread by professional annotators, with high accuracy.
Description: It collects 2,507 speakers from Sichuan Basin and is recorded in quiet indoor environment. The recorded content covers customer consultation and text messages in many fields. The average number of repetitions is 1.3 and the average sentence length is 12.5 words. Sichuan natives participate in quality inspection and proofreading to ensure the accuracy of the text transcription.
Description: 1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene. We transcribed the speech into text manually to ensure high accuracy.
Description: Mobile phone captured audio data of Wuhan dialect, 997 hours in total, recorded by more than 2,000 Wuhan dialect native speakers. The recorded text covers generic, interactive, on-board, home and other categories, with rich contents. Wuhan locals participate in quality check and proofreading. Sentence accuracy rate reaches 95 %; this data set can be used for automatic speech recognition, machine translation, and voiceprint recognition.
It collects 4,888 speakers from Guangdong Province and is recorded in quiet indoor environment. The recorded content covers 500,000 commonly used spoken sentences, including high-frequency words in weico and daily used expressions. The average number of repetitions is 1.5 and the average sentence length is 12.5 words. Recording devices are mainstream Android phones and iPhones.
It collects 2.956 speakers from Shanghai and is recorded in quiet indoor environment. The recorded content includes multi-domain customer consultation, short messages, numbers, Shanghai POI, etc. The corpus has no repetition and the average sentence length is 12.68 words. Recording devices are mainstream Android phones and iPhones.