🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

324 dataset results for Chinese

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train and validation recordings.

3 PAPERS • 7 BENCHMARKS

ImageNet_CN

ImageNet_CN (Chinese ImageNet Classification)

transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.

3 PAPERS • 1 BENCHMARK

LEVEN

LEVEN (Legal Event Detection Dataset)

Overview LEVEN is the largest Legal Event Detection dataset as well as the largest Chinese Event Detection dataset.

3 PAPERS • NO BENCHMARKS YET

MISP2021 (Multimodal Information Based Speech Processing 2021)

The MISP2021 challenge dataset is a collection of audio-visual conversational data recorded in a home TV scenario using distant multi-microphones. The dataset captures interactions between several individuals who are engaged in conversations in Chinese while watching TV and interacting with a smart speaker/TV in a living room. The dataset is extensive, comprising 141 hours of audio and video data, which were collected using far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. Notably, this corpus is the first of its kind to offer a distant multimicrophone conversational Chinese audio-visual dataset. Furthermore, it is also the first large vocabulary continuous Chinese lip-reading dataset specifically designed for the adverse home-TV scenario.

3 PAPERS • NO BENCHMARKS YET

MMChat

A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)

3 PAPERS • NO BENCHMARKS YET

ODSQA

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 PAPERS • NO BENCHMARKS YET

OpenLane-V2 test

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.

3 PAPERS • 1 BENCHMARK

SWSR

SWSR (Sina Weibo Sexism Review)

The Sina Weibo Sexism Review (SWSR) dataset is a dataset to research online sexism in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language.

3 PAPERS • NO BENCHMARKS YET

Title2Event

Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.

3 PAPERS • NO BENCHMARKS YET

WDC-Dialogue

WDC-Dialogue is a dataset built from the Chinese social media to train EVA. Specifically, conversations from various sources are gathered and a rigorous data cleaning pipeline is designed to enforce the quality of WDC-Dialogue.

3 PAPERS • NO BENCHMARKS YET

WeiboPolls

Dataset Description The dataset described in the provided text is focused on social media polls collected from Weibo, a popular Chinese microblogging platform. The dataset aims to empirically study social media polls and analyze user engagement patterns.

3 PAPERS • 3 BENCHMARKS

Wikipedia Title

Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.

3 PAPERS • NO BENCHMARKS YET

XWINO

XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.

3 PAPERS • 1 BENCHMARK

mTVR

mTVR is a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles. Compared to existing moment retrieval datasets, mTVR is multilingual, larger, and comes with diverse annotations.

3 PAPERS • NO BENCHMARKS YET

Apolloscape Trajectory

Our trajectory dataset consists of camera-based images, LiDAR scanned point clouds, and manually annotated trajectories. It is collected under various lighting conditions and traffic densities in Beijing, China. More specifically, it contains highly complicated traffic flows mixed with vehicles, riders, and pedestrians.

2 PAPERS • 1 BENCHMARK

CA4P-483

CA4P-483 is a dataset designed to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. It contains 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations.

2 PAPERS • NO BENCHMARKS YET

CSCD-IME

Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), a dataset containing 40,000 annotated sentences from real posts of official media on Sina Weibo. It is designed to detect and correct spelling mistakes in Chinese texts.

2 PAPERS • NO BENCHMARKS YET

Chinese Classifier

Classifiers are function words that are used to express quantities in Chinese and are especially difficult for language learners. This dataset of Chinese Classifiers can be used to predict Chinese classifiers from context. The dataset contains a large collection of example sentences for Chinese classifier usage derived from three language corpora (Lancaster Corpus of Mandarin Chinese, UCLA Corpus of Written Chinese and Leiden Weibo Corpus). The data was cleaned and processed for a context-based classifier prediction task.

2 PAPERS • NO BENCHMARKS YET

Chinese Gigaword

Chinese Gigaword corpus consists of 2.2M of headline-document pairs of news stories covering over 284 months from two Chinese newspapers, namely the Xinhua News Agency of China (XIN) and the Central News Agency of Taiwan (CNA).

2 PAPERS • NO BENCHMARKS YET

DialogUSR

DialogUSR dataset covers 23 domains with a multi-step crowd-sourcing procedure. It comprises 36.7 Chinese characters by assembling 3.6 single-intent queries (including initial and follow-up queries) and is designed for dialogue utterance splitting and reformulation task.

2 PAPERS • NO BENCHMARKS YET

ExpMRC

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

2 PAPERS • 4 BENCHMARKS

FinVis

Pretrain: 200k Instruction: 100k

2 PAPERS • NO BENCHMARKS YET

GBUSV (Gallbladder Ultrasound Videos)

Description GBUSV is a un-annotated dataset consisting of ultrasound videos of of patients with either of a malignant or a non-malignant gallbladder. The ultrasound videos were obtained from patients referred to the radiology department of PGIMER, Chandigarh (a high-input hospital in Northern India) for abdominal ultrasound examinations of suspected gallbladder pathologies. Patients were at fasting of at least 6 hours. A 1-5 MHz curved array transducer (C-1-5D, Logiq S8, GE Healthcare) was used. The scanning intended to include the entire gallbladder and the lesion or pathology. The length of the video sequences varies from 43 to 888 frames. The dataset consists of 32 malignant and 32 non-malignant videos containing a total of 12,251 and 3,549 frames, respectively. The video frames are cropped from the center to anonymize the patient information and annotations. The processed frame sizes are of size 360x480 pixels.

2 PAPERS • NO BENCHMARKS YET

Hansel

Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:

2 PAPERS • NO BENCHMARKS YET

K-SportsSum

K-SportsSum is a sports game summarization dataset with two characteristics: (1) K-SportsSum collects a large amount of data from massive games. It has 7,854 commentary-news pairs. To improve the quality, K-SportsSum employs a manual cleaning process; (2) Different from existing datasets, to narrow the knowledge gap, K-SportsSum further provides a large-scale knowledge corpus that contains the information of 523 sports teams and 14,724 sports players.

2 PAPERS • NO BENCHMARKS YET

LIS (low-light instance segmentation)

To reveal and systematically investigate the effectiveness of the proposed method in the real world, a real low-light image dataset for instance segmentation is necessary and urgently needed. Considering there is no suitable dataset, therefore, we collect and annotate a Low-light Instance Segmentation (LIS) dataset using a Canon EOS 5D Mark IV camera.

2 PAPERS • NO BENCHMARKS YET

MCSCSet

MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.

2 PAPERS • NO BENCHMARKS YET

MSDA (Multi-source domain adaptation dataset for text recognition)

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images

2 PAPERS • 2 BENCHMARKS

MultiSpider

MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).

2 PAPERS • NO BENCHMARKS YET

OIR

OIR is a financial-domain dataset of the outbound intent recognition task. It aims to identify the intent of customer response in the outbound call scenario.

2 PAPERS • NO BENCHMARKS YET

PETCI

PETCI (PETCI: A Parallel English Translation Dataset of Chinese Idioms)

PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.

2 PAPERS • NO BENCHMARKS YET

PKU (License Plate Detection)

The PKU dataset has almost 4,000 images categorized into five groups (G1-G5) that show different situations. For example, G1 has images of highways during the day with only one car in them. On the other hand, G5 has images of crosswalks during the day or at night with multiple cars and license plates (LPs).

2 PAPERS • NO BENCHMARKS YET

Parallel Meaning Bank

The Parallel Meaning Bank (PMB), developed at the University of Groningen and building upon the Groningen Meaning Bank, comprises sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The main objective of the PMB is to provide fine-grained meaning representations for words, sentences and texts. Sentences are, in isolation, often ambiguous. The aim is to provide the most likely interpretation for a sentence, with a minimal use of underspecification.

2 PAPERS • NO BENCHMARKS YET

ProSLU

ProSLU (Profile-based Spoken Language Understanding)

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), which requires a model not only depends on the text but also on the given supporting profile information. We further introduce a Chinese human-annotated dataset, with over 5K utterances annotated with intent and slots, and corresponding supporting profile information. In total, we provide three types of supporting profile information: (1) Knowledge Graph (KG) consists of entities with rich attributes, (2) User Profile (UP) is composed of user settings and information, (3) Context Awareness(CA) is user state and environmental information.

2 PAPERS • 3 BENCHMARKS

Real 3D-AD

Real 3D-AD is the first point cloud anomaly detection dataset for industrial products. Real3D-AD comprises a total of 1,254 samples that are distributed across 12 distinct categories. These categories include Airplane, Car, Candybar, Chicken, Diamond, Duck, Fish, Gemstone, Seahorse, Shell, Starfish, and Toffees. Each training sample is an absence of blind spots, and a realistic, high-accuracy prototype.

2 PAPERS • 1 BENCHMARK

SSD

SSD (Sub-Slot Dialogue dataset)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots". arxiv

2 PAPERS • NO BENCHMARKS YET

SSD_PHONE

SSD_PHONE (Sub-Slot Dialogue dataset phone domain)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".

2 PAPERS • NO BENCHMARKS YET

ShopSign

A newly developed natural scene text dataset of Chinese shop signs in street views.

2 PAPERS • NO BENCHMARKS YET

TE141K

A new text effects dataset with 141,081 text effect/glyph pairs in total. The dataset consists of 152 professionally designed text effects rendered on glyphs, including English letters, Chinese characters, and Arabic numerals.

2 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

VGaokao

VGaokao is a verification style reading comprehension dataset designed for native speakers' evaluation.

2 PAPERS • NO BENCHMARKS YET

Weibo-COV

Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.

2 PAPERS • NO BENCHMARKS YET

XL-R2R

XL-R2R (Cross-lingual Room-to-Room)

The XL-R2R dataset is built upon the R2R dataset and extends it with Chinese instructions. XL-R2R preserves the same splits as in R2R and thus consists of train, val-seen, and val-unseen splits with both English and Chinese instructions, and test split with English instructions only.

2 PAPERS • NO BENCHMARKS YET

Youku-mPLUG

Youku-mPLUG is a large Chinese high-quality video-language dataset which is collected from Youku.com, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. It contains 10 million video-text pairs for pre-training and 0.3 millon videos for downstream benchmarks covering Video-Text Retrieval, Video Captioning and Video Category Classification.

2 PAPERS • NO BENCHMARKS YET

formalgeo7k

6981 SAT-level geometry problem with complete natural language description, geometric shapes, formal language annotations, and theorem sequences annotations.

2 PAPERS • NO BENCHMARKS YET

AQL-22

AQL-22 (Archive Query Log)

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

1 PAPER • NO BENCHMARKS YET

ASR-RAMC-BIGCCSC: A CHINESE CONVERSATIONAL SPEECH CORPUS

A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.

1 PAPER • NO BENCHMARKS YET

Analysing state-backed propaganda websites: a new dataset and linguistic study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.

1 PAPER • NO BENCHMARKS YET

Datasets

324 dataset results for Chinese