TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Dialogue Evaluation	USR-TopicalChat	MDD-Eval	Spearman Correlation	0.5109	# 1
Dialogue Evaluation	USR-TopicalChat	MDD-Eval	Pearson Correlation	0.4575	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mdd-eval-self-training-on-augmented-data-for/dialogue-evaluation-on-usr-topicalchat)](https://paperswithcode.com/sota/dialogue-evaluation-on-usr-topicalchat?p=mdd-eval-self-training-on-augmented-data-for)`

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

14 Dec 2021 · Chen Zhang, Luis Fernando D'Haro, Thomas Friedrichs, Haizhou Li ·

Chatbots are designed to carry out human-like conversations across different domains, such as general chit-chat, knowledge exchange, and persona-grounded conversations. To measure the quality of such conversational agents, a dialogue evaluator is expected to conduct assessment across domains as well. However, most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation. We are motivated to design a general and robust framework, MDD-Eval, to address the problem. Specifically, we first train a teacher evaluator with human-annotated data to acquire a rating skill to tell good dialogue responses from bad ones in a particular domain and then, adopt a self-training strategy to train a new evaluator with teacher-annotated multi-domain data, that helps the new evaluator to generalize across multiple domains. MDD-Eval is extensively assessed on six dialogue evaluation benchmarks. Empirical results show that the MDD-Eval framework achieves a strong performance with an absolute improvement of 7% over the state-of-the-art ADMs in terms of mean Spearman correlation scores across all the evaluation benchmarks.

PDF Abstract

Code

Add Remove Mark official

e0397123/mdd-eval official

Tasks

Add Remove

Dialogue Evaluation

Datasets

ConceptNet

DailyDialog

PERSONA-CHAT

ConvAI2 Topical-Chat USR-TopicalChat

Results from the Paper

Edit

Ranked #1 on Dialogue Evaluation on USR-TopicalChat

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Dialogue Evaluation	USR-TopicalChat	MDD-Eval	Spearman Correlation	0.5109	# 1		Compare
Dialogue Evaluation	USR-TopicalChat	MDD-Eval	Pearson Correlation	0.4575	# 2		Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove