The DialSummEval is a multi-faceted dataset of human judgments. It was created to revisit the evaluation of dialogue summarization models. The dataset contains the outputs of 14 models on SAMSum, a dialogue summary dataset.
The creators of DialSummEval observed that current dialogue summarization models have flaws that may not be well exposed by frequently used metrics such as ROUGE. Therefore, they re-evaluated 18 categories of metrics in terms of four dimensions: coherence, consistency, fluency, and relevance. They also conducted a unified human evaluation of various models in dialogue summarization for the first time.
Paper | Code | Results | Date | Stars |
---|