Assessing Dialogue Systems with Distribution Distances

Findings (ACL) 2021 · Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, Lemao Liu ·

An important aspect of developing dialogue systems is how to evaluate and compare the performance of different systems. Existing automatic evaluation metrics are based on turn-level quality evaluation and use average scores for system-level comparison. In this paper, we propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations. Specifically, two distribution-wise metrics, FBD and PRD, are developed and evaluated. Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.