Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.
Source: MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension MetricsPaper | Code | Results | Date | Stars |
---|