Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus
In this paper, we analyze neural network-based dialogue systems trained in an end-to-end manner using an updated version of the recent Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This dataset is interesting because of its size, long context lengths, and technical nature; thus, it can be used to train large models directly from data with minimal feature engineering. We provide baselines in two different environments: one where models are trained to select the correct next response from a list of candidate responses, and one where models are trained to maximize the loglikelihood of a generated utterance conditioned on the context of the conversation. These are both evaluated on a recall task that we call next utterance classification (NUC), and using vector-based metrics that capture the topicality of the responses. We observe that current end-to-end models are unable to completely solve these tasks; thus, we provide a qualitative error analysis to determine the primary causes of error for end-to-end models evaluated on NUC, and examine sample utterances from the generative models. As a result of this analysis, we suggest some promising directions for future research on the Ubuntu Dialogue Corpus, which can also be applied to end-to-end dialogue systems in general.
PDFDatasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Conversation Disentanglement | Linux IRC (Ch2 Elsner) | Heuristic | 1-1 | 45.1 | # 3 | |
Local | 73.8 | # 3 | ||||
Shen F-1 | 51.8 | # 3 | ||||
Conversation Disentanglement | Linux IRC (Ch2 Kummerfeld) | Heuristic | 1-1 | 43.4 | # 3 | |
Local | 67.9 | # 2 | ||||
Shen F-1 | 50.7 | # 2 |