Ubuntu Dialogue Corpus (UDC) is a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.
44 PAPERS • 8 BENCHMARKS
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them.
13 PAPERS • 1 BENCHMARK
SODA is a high-quality social dialogue dataset. In contrast to most existing crowdsourced, small-scale dialogue corpora, Soda distills 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., ). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x).
11 PAPERS • NO BENCHMARKS YET