Dialogue Evaluation
39 papers with code • 2 benchmarks • 2 datasets
Most implemented papers
Adversarial Learning for Neural Dialogue Generation
In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances.
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level.
Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References
The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation.
Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems
In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems.
Automatic Evaluation and Moderation of Open-domain Dialogue Systems
The development of Open-Domain Dialogue Systems (ODS)is a trending topic due to the large number of research challenges, large societal and business impact, and advances in the underlying technology.
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
Open-domain human-computer conversation has been attracting increasing attention over the past few years.
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem.
Evaluating Coherence in Dialogue Systems using Entailment
Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers.
Towards Best Experiment Design for Evaluating Dialogue System Output
To overcome the limitations of automated metrics (e. g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence.
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems
Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems.