UIT-ViCoQA is a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles.
1 PAPER • NO BENCHMARKS YET
UIT-ViSFD is a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset is used for spam review detection (opinion spam reviews) on Vietnamese E-commerce website
A Vietnamese dataset for hate speech detection by the specific target. The dataset contains 10,000 comments, each comment has 05 targets with three relevant hateful levels.
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
1 PAPER • 2 BENCHMARKS
We introduce a first Vietnamese Spelling Correction dataset containing manual labelling mistakes and corresponding correct words.
The VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube - an extensive source of user-uploaded content, covering the topics of food and travel in the Vietnamese language. This dataset is used for research in Vietnamese Spoken-Based Machine Reading Comprehension.
WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.
xMIND is an open, large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND dataset using open-source neural machine translation (i.e., NLLB 3.3B).