We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.
15 PAPERS • 3 BENCHMARKS
Contains a base version (6.8million dialogues) and a large version (12.0 million dialogues).
13 PAPERS • NO BENCHMARKS YET
PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.
9 PAPERS • NO BENCHMARKS YET
KaMed is a knowledge-aware medical dialogue dataset, which contains over 60,000 medical dialogue sessions with 5,682 entities (such as Asthma and Atropine).
6 PAPERS • NO BENCHMARKS YET
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)
3 PAPERS • NO BENCHMARKS YET
WDC-Dialogue is a dataset built from the Chinese social media to train EVA. Specifically, conversations from various sources are gathered and a rigorous data cleaning pipeline is designed to enforce the quality of WDC-Dialogue.
JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.
1 PAPER • NO BENCHMARKS YET