59 dataset results for Vietnamese

UIT-ViCoQA (Conversational machine reading comprehension in the Vietnamese language)

UIT-ViCoQA is a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles.

1 PAPER • NO BENCHMARKS YET

UIT-ViSFD

UIT-ViSFD (Vietnamese Aspect-Based Sentiment Analysis Dataset)

UIT-ViSFD is a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes.

1 PAPER • NO BENCHMARKS YET

ViMATH

ViMATH (Vietnamese MATH)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

ViSR

ViSR (Vietnamese Synthetic Reasoning)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

ViSpamReviews

ViSpamReviews (Vietnamese Spam Reviews Detection)

This dataset is used for spam review detection (opinion spam reviews) on Vietnamese E-commerce website

1 PAPER • NO BENCHMARKS YET

ViTHSD

ViTHSD (Vietnamese Targeted-Hate-Speech-Detection)

A Vietnamese dataset for hate speech detection by the specific target. The dataset contains 10,000 comments, each comment has 05 targets with three relevant hateful levels.

1 PAPER • NO BENCHMARKS YET

VietMed

VietMed (VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain)

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.

1 PAPER • 2 BENCHMARKS

Viwiki-Spelling

Viwiki-Spelling (Vietnamese Spelling Correction Dataset)

We introduce a first Vietnamese Spelling Correction dataset containing manual labelling mistakes and corresponding correct words.

1 PAPER • NO BENCHMARKS YET

VlogQA

VlogQA (Vietnamese Spoken-Based Machine Reading Comprehension)

The VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube - an extensive source of user-uploaded content, covering the topics of food and travel in the Vietnamese language. This dataset is used for research in Vietnamese Spoken-Based Machine Reading Comprehension.

1 PAPER • NO BENCHMARKS YET

WEATHub

WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.

1 PAPER • NO BENCHMARKS YET

xMIND

xMIND (A Multilingual Dataset for Cross-lingual News Recommendation)

xMIND is an open, large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND dataset using open-source neural machine translation (i.e., NLLB 3.3B).

1 PAPER • NO BENCHMARKS YET

Datasets

59 dataset results for Vietnamese