🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

57 dataset results for Vietnamese

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.

505 PAPERS • 12 BENCHMARKS

XNLI (Cross-lingual Natural Language Inference)

The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.

328 PAPERS • 10 BENCHMARKS

Common Voice

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.

314 PAPERS • 164 BENCHMARKS

MuST-C

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

194 PAPERS • 2 BENCHMARKS

XQuAD

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.

170 PAPERS • 1 BENCHMARK

MLQA (MultiLingual Question Answering)

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.

151 PAPERS • 1 BENCHMARK

CC100

This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.

96 PAPERS • NO BENCHMARKS YET

WikiANN

WikiANN (PAN-X)

WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.

57 PAPERS • 3 BENCHMARKS

OSCAR

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.

55 PAPERS • NO BENCHMARKS YET

ICDAR 2015

ICDAR 2015 was a scene text detection used for the ICDAR 2015 conference.

51 PAPERS • 2 BENCHMARKS

WikiLingua

WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.

50 PAPERS • 5 BENCHMARKS

XL-Sum

XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

41 PAPERS • NO BENCHMARKS YET

MKQA (Multilingual Knowledge Questions and Answers)

Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.

37 PAPERS • NO BENCHMARKS YET

IGLUE (Image-Grounded Language Understanding Evaluation)

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

21 PAPERS • 13 BENCHMARKS

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan

20 PAPERS • 2 BENCHMARKS

Belebele

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.

17 PAPERS • NO BENCHMARKS YET

OASST1

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

14 PAPERS • NO BENCHMARKS YET

UIT-ViQuAD

A new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.

13 PAPERS • 1 BENCHMARK

Synbols

Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.

11 PAPERS • NO BENCHMARKS YET

Duolingo STAPLE Shared Task

This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com

10 PAPERS • NO BENCHMARKS YET

ViHSD

ViHSD (Vietnamese Hate Speech Detection Dataset)

This dataset contains 33,400 annotated comments used for hate speech detection on social network sites. Label: CLEAN (non hate), OFFENSIVE and HATE

9 PAPERS • NO BENCHMARKS YET

UIT-ViCTSD

UIT-ViCTSD (UIT Vietnamese Constructive and Toxic Speech Detection)

UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.

7 PAPERS • NO BENCHMARKS YET

VNHSGE

VNHSGE (VietNamese High School Graduation Examination Dataset for Large Language Models)

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t

7 PAPERS • 9 BENCHMARKS

PhoMT

PhoMT is a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs for machine translation.

6 PAPERS • NO BENCHMARKS YET

VIVOS (VIVOS Corpus)

VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for Automatic Speech Recognition task.

6 PAPERS • 1 BENCHMARK

UIT-ViIC

UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.

5 PAPERS • NO BENCHMARKS YET

ATIS (vi)

ATIS (vi) (Vietnamese Intent Detection and Slot Filling)

This is a dataset for intent detection and slot filling for the Vietnamese language. The dataset consists of 5,871 gold annotated utterances with 28 intent labels and 82 slot types.

4 PAPERS • 3 BENCHMARKS

UIT-VSMEC

UIT-VSMEC (Vietnamese Social Media Emotion Corpus)

Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model

4 PAPERS • NO BENCHMARKS YET

UIT-ViNewsQA

UIT-ViNewsQA is a new corpus for the Vietnamese language to evaluate healthcare reading comprehension models. The corpus comprises 22,057 human-generated question-answer pairs. Crowd-workers create the questions and their answers based on a collection of over 4,416 online Vietnamese healthcare news articles, where the answers comprise spans extracted from the corresponding articles.

4 PAPERS • NO BENCHMARKS YET

PhoNER COVID19

PhoNER_COVID19 is a dataset for recognising COVID-19 related named entities in Vietnamese, consisting of 35K entities over 10K sentences. The authors defined 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics.

3 PAPERS • 1 BENCHMARK

UIT-EVJVQA (English-Japanese-Vietnamese Visual Question Answering)

UIT-EVJVQA, the first multilingual Visual Question Answering dataset with three languages: English, Vietnamese, and Japanese, is released in this task. UIT-EVJVQA includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image. UIT-EVJVQA consists of 33,000+ question-answer pairs for evaluating the mQA models.

3 PAPERS • NO BENCHMARKS YET

UIT-ViCoV19QA

The dataset comprises 4,500 question-answer pairs collected from trusted medical sources, with at least one answer and at most four unique paraphrased answers per question

3 PAPERS • NO BENCHMARKS YET

UIT-ViWikiQA

The UIT-ViWikiQA is a dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of 23,074 question-answers based on 5,109 passages of 174 Vietnamese articles from Wikipedia.

3 PAPERS • NO BENCHMARKS YET

ViMQ

ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.

3 PAPERS • NO BENCHMARKS YET

MultiSpider

MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).

2 PAPERS • NO BENCHMARKS YET

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish - Slovak - Irish - Hungarian - French - Turkish - Spanish - Croatian

2 PAPERS • 12 BENCHMARKS

OpenViVQA (Open-domain Visual Question Answering in Vietnamese)

In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering

2 PAPERS • NO BENCHMARKS YET

TyDiP (A Dataset for Politeness Classification in Nine Typologically Diverse Languages)

A Dataset for Politeness Classification in Nine Typologically Diverse Languages (TyDiP) is a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples.

2 PAPERS • NO BENCHMARKS YET

VNDS

VNDS (VNDS: A Vietnamese Dataset for Summarization)

A single-document Vietnamese summarization dataset

2 PAPERS • 1 BENCHMARK

ViNLI

ViNLI (Vietnamese Natural Language Inference Dataset)

A large-scale and high-quality corpus is necessary for studies on NLI for Vietnamese, which can be considered a low-resource language. In this paper, we introduce ViNLI (Vietnamese Natural Language Inference), an open-domain and high-quality corpus for evaluating Vietnamese NLI models, which is created and evaluated with a strict process of quality control. ViNLI comprises over 30,000 human-annotated premise-hypothesis sentence pairs extracted from more than 800 online news articles on 13 distinct topics.

2 PAPERS • NO BENCHMARKS YET

ViText2SQL

ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.

2 PAPERS • NO BENCHMARKS YET

AISIA-VN-Review-S

AISIA-VN-Review-S (AISIA-VN-Review-F)

In AISIA-VN-Review-S and AISIA-VN-Review-F datasets, we first collect 450K customer reviewing comments from various e–commerce websites. Then, we manually label each review to be either positive or negative, resulting in 358,743 positive reviews and 100,699 negative reviews. We named this dataset the sentiment classification from reviews collected by AISIA, the full version (AISIA-VN-Review-F). However, in this work, we are interested in improving the model’s performance when the training data are limited; thus, we only consider a subset of up to 25K training reviews and evaluate the model on another 170K reviews. We refer to this subset from the full dataset as AISIA-VN-Review-S. It is important to emphasize that our team spends a lot of time and effort to manually classify each review into positive or negative sentiments.

1 PAPER • NO BENCHMARKS YET

DivEMT (Post-Editing Effort Across Typologically-diverse Languages)

DivEMT, the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages. Using a strictly controlled setup, 18 professional translators were instructed to translate or post-edit the same set of English documents into Arabic, Dutch, Italian, Turkish, Ukrainian, and Vietnamese. During the process, their edits, keystrokes, editing times and pauses were recorded, enabling an in-depth, cross-lingual evaluation of NMT quality and post-editing effectiveness. Using this new dataset, we assess the impact of two state-of-the-art NMT systems, Google Translate and the multilingual mBART-50 model, on translation productivity.

1 PAPER • NO BENCHMARKS YET

MGSM8KInstruct

MGSM8KInstruct, the multilingual math reasoning instruction dataset, encompassing ten distinct languages, thus addressing the issue of training data scarcity in multilingual math reasoning.

1 PAPER • NO BENCHMARKS YET

MVALUE

MVALUE (Multilingual human VALUE dataset)

Multilingual human VALUE(MVALUE) is a multilingual dataset covering 7 concepts of human values: morality, deontology, utilitarianism, fairness, truthfulness, toxicity and harmfulness, each concept subset of it includes positive and negative texts that represent the two opposing directions of the concept. We performed translation on collected human value datasets from English into 15 non-English languages using Google Translate. These languages belong to various language families, including Indo-European (Catalan, French, Indonesian, Portuguese, Spanish), NigerCongo (Chichewa, Swahili), Dravidian (Tamil, Telugu), Uralic (Finnish, Hungarian), Sino-Tibetan (Chinese), Japonic (Japanese), Koreanic (Korean) and Austro-Asiatic (Vietnamese).

1 PAPER • NO BENCHMARKS YET

Synthetic Reasoning - Natural Language

Synthetic Reasoning - Natural Language (Vietnamese Synthetic Reasoning - Natural Language)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

UIT-ViCoQA

UIT-ViCoQA (Conversational machine reading comprehension in the Vietnamese language)

UIT-ViCoQA is a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles.

1 PAPER • NO BENCHMARKS YET

UIT-ViSFD

UIT-ViSFD (Vietnamese Aspect-Based Sentiment Analysis Dataset)

UIT-ViSFD is a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes.

1 PAPER • NO BENCHMARKS YET

Datasets

57 dataset results for Vietnamese