Dialogue Evaluation
48 papers with code • 2 benchmarks • 6 datasets
Latest papers with no code
Structured Information Matters: Incorporating Abstract Meaning Representation into LLMs for Improved Open-Domain Dialogue Evaluation
Trainable evaluation metrics are commonly trained with true positive and randomly selected negative responses, resulting in a tendency for them to assign a higher score to the responses that share higher content similarity with a given context.
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison
Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories.
A Three-Phases SFT Hybrid Model Integrated Strong Prior Module and Data Overlap Estimation in the Eduation Context
More specifically, our model realizes the structural disassembly and incremental guided output of educational knowledge.
DialogBench: Evaluating LLMs as Human-like Dialogue Systems
In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have.
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem.
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation
Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems.
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation
We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog.
U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation
In this paper, we construct a user needs-centric E-commerce conversational recommendation dataset (U-NEED) from real-world E-commerce scenarios.
Pragmatically Appropriate Diversity for Dialogue Evaluation
To remedy this, we propose the notion of Pragmatically Appropriate Diversity, defined as the extent to which a conversation creates and constrains the creation of multiple diverse responses.
Improving Open-Domain Dialogue Evaluation with a Causal Inference Model
We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings.