Dialogue Evaluation
48 papers with code • 2 benchmarks • 6 datasets
Latest papers
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc.
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark
The English dialogue data are extended to nine other languages with commercial machine translation systems.
Towards Multilingual Automatic Dialogue Evaluation
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is the lack of multilingual data and the limited availability of open sourced multilingual dialogue systems.
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation
Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English.
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation
Existing reference-free turn-level evaluation metrics for chatbots inadequately capture the interaction between the user and the system.
DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation
Despite the recent advances in open-domain dialogue systems, building a reliable evaluation metric is still a challenging problem.
GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation
We present GLM-Dialog, a large-scale language model (LLM) with 10B parameters capable of knowledge-grounded conversation in Chinese using a search engine to access the Internet knowledge.
Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment.
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
This paper introduces a novel Self-supervised Fine-grained Dialogue Evaluation framework (SelF-Eval).