Dialogue is notoriously hard to evaluate. Past approaches have used human evaluation.

Leaderboards

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Greatest papers with code