We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.
15 PAPERS • 3 BENCHMARKS
DuLeMon is a large-scale Chinese Long-term Memory Conversation dataset, which simulates long-term memory conversations and focuses on the ability to actively construct and utilize the user's and the bot's persona in a long-term interaction. DuLeMon contains about 27.5k human-human conversations, 449k utterances, and 12k persona grounding sentences. This corpus can be used to explore Long-term Memory Conversation, Personalized Dialogue, and Persona Extraction / Matching / Retrieval.
12 PAPERS • NO BENCHMARKS YET
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
The Arabic-TOD dataset is based on the BiToD dataset. Of the 3,689 BiToD-English dialogues, 1,500 dialogues (30,000 utterances) were translated into Arabic. We translated the task-related keywords such as cuisine, dietary restrictions, and price-level for the restaurant domain, price-level for the hotel domain, type, and price-level for the attraction domain, day, weather, and city for the weather domain. We keep the rest of values without translation, like hotels’ and restaurants’ names, locations, and addresses. These values are real entities in Hong Kong city (literals), and most of them contain Chinese words written in English, therefore they have not been translated. According to the slot-values in the Arabic-TOD dataset, we used the slots names as they are in English and translated their corresponding values, except the entities in Hong Kong city since the Arabic-TOD dataset supports codeswitching.
Dataset Overview vanilla.csv: Represents the interactions without specific role-play instructions. boss.csv: Interactions where ChatGPT plays the role of a user's boss. classmate.csv: Interactions with ChatGPT acting as the user's classmate. Each turn was coded with user motives of user responses, or the perceived naturalness of ChatGPT responses.