Dialogue Evaluation

48 papers with code • 2 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Latest papers with no code

Structured Information Matters: Incorporating Abstract Meaning Representation into LLMs for Improved Open-Domain Dialogue Evaluation

no code yet • 1 Apr 2024

Trainable evaluation metrics are commonly trained with true positive and randomly selected negative responses, resulting in a tendency for them to assign a higher score to the responses that share higher content similarity with a given context.

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

no code yet • 1 Apr 2024

Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories.

A Three-Phases SFT Hybrid Model Integrated Strong Prior Module and Data Overlap Estimation in the Eduation Context

no code yet • 13 Mar 2024

More specifically, our model realizes the structural disassembly and incremental guided output of educational knowledge.

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

no code yet • 3 Nov 2023

In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have.

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

no code yet • 15 Sep 2023

To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem.

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

no code yet • 14 Sep 2023

Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems.

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

no code yet • 23 May 2023

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog.

U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation

no code yet • 5 May 2023

In this paper, we construct a user needs-centric E-commerce conversational recommendation dataset (U-NEED) from real-world E-commerce scenarios.

Pragmatically Appropriate Diversity for Dialogue Evaluation

no code yet • 6 Apr 2023

To remedy this, we propose the notion of Pragmatically Appropriate Diversity, defined as the extent to which a conversation creates and constrains the creation of multiple diverse responses.

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

no code yet • 31 Jan 2023

We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings.