TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Dialogue Evaluation	USR-PersonaChat	Lin-Reg (all)	Spearman Correlation	0.5382	# 1
Dialogue Evaluation	USR-PersonaChat	Lin-Reg (all)	Pearson Correlation	0.5290	# 2
Dialogue Evaluation	USR-TopicalChat	Lin-Reg (all)	Spearman Correlation	0.4877	# 2
Dialogue Evaluation	USR-TopicalChat	Lin-Reg (all)	Pearson Correlation	0.4974	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/proxy-indicators-for-the-quality-of-open/dialogue-evaluation-on-usr-personachat)](https://paperswithcode.com/sota/dialogue-evaluation-on-usr-personachat?p=proxy-indicators-for-the-quality-of-open)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/proxy-indicators-for-the-quality-of-open/dialogue-evaluation-on-usr-topicalchat)](https://paperswithcode.com/sota/dialogue-evaluation-on-usr-topicalchat?p=proxy-indicators-for-the-quality-of-open)`

Proxy Indicators for the Quality of Open-domain Dialogues

EMNLP 2021 · Rostislav Nedelchev, Jens Lehmann, Ricardo Usbeck ·

The automatic evaluation of open-domain dialogues remains a largely unsolved challenge. Despite the abundance of work done in the field, human judges have to evaluate dialogues’ quality. As a consequence, performing such evaluations at scale is usually expensive. This work investigates using a deep-learning model trained on the General Language Understanding Evaluation (GLUE) benchmark to serve as a quality indication of open-domain dialogues. The aim is to use the various GLUE tasks as different perspectives on judging the quality of conversation, thus reducing the need for additional training data or responses that serve as quality references. Due to this nature, the method can infer various quality metrics and can derive a component-based overall score. We achieve statistically significant correlation coefficients of up to 0.7.

PDF Abstract

Code

Add Remove Mark official

smartdataanalytics/proxy_indicators official

Tasks

Add Remove

Dialogue Evaluation

Datasets

GLUE

SST

MultiNLI SST-2

QNLI

MRPC

CoLA

WSC

PERSONA-CHAT Topical-Chat USR-TopicalChat USR-PersonaChat

Results from the Paper

Add Remove

Ranked #1 on Dialogue Evaluation on USR-PersonaChat

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Dialogue Evaluation	USR-PersonaChat	Lin-Reg (all)	Spearman Correlation	0.5382	# 1	Compare
Dialogue Evaluation	USR-PersonaChat	Lin-Reg (all)	Pearson Correlation	0.5290	# 2	Compare
Dialogue Evaluation	USR-TopicalChat	Lin-Reg (all)	Spearman Correlation	0.4877	# 2	Compare
Dialogue Evaluation	USR-TopicalChat	Lin-Reg (all)	Pearson Correlation	0.4974	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Proxy Indicators for the Quality of Open-domain Dialogues

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove