The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
176 PAPERS • 4 BENCHMARKS
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
98 PAPERS • 1 BENCHMARK
The PERSONA-CHAT dataset contains multi-turn dialogues conditioned on personas. The dataset consists of 8939 complete dialogues for training, 1000 for validation, and 968 for testing. Each dialogue was performed between two crowd-source workers assuming artificial personas (described by 3 to 5 profile sentences, such as “I like to ski”, “I am an artist”, “I eat sardines for breakfast daily”). There are 955 possible personas for training, 100 for validation, and 100 for testing. Additionally, a version of revised persona descriptions are also provided by rephrasing, generalizing, or specializing the original ones.
96 PAPERS • 1 BENCHMARK
Ubuntu Dialogue Corpus (UDC) is a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.
34 PAPERS • 7 BENCHMARKS
This is a document grounded dataset for text conversations. "Document Grounded Conversations" are conversations that are about the contents of a specified document. In this dataset the specified documents are Wikipedia articles about popular movies. The dataset contains 4112 conversations with an average of 21.43 turns per conversation.
10 PAPERS • NO BENCHMARKS YET
OpenDialKG contains utterance from 15K human-to-human role-playing dialogs is manually annotated with ground-truth reference to corresponding entities and paths from a large-scale KG with 1M+ facts.
9 PAPERS • NO BENCHMARKS YET
PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.
4 PAPERS • NO BENCHMARKS YET
Collected by leveraging background knowledge from a larger, more highly represented dialogue source.
2 PAPERS • NO BENCHMARKS YET
Contains a base version (6.8million dialogues) and a large version (12.0 million dialogues).
1 PAPER • NO BENCHMARKS YET
OpenViDial is a large-scale open-domain dialogue dataset with visual contexts. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.
1 PAPER • NO BENCHMARKS YET