BLiMP (Benchmark of Linguistic Minimal Pairs)

Introduced by Warstadt et al. in BLiMP: The Benchmark of Linguistic Minimal Pairs for English

BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. Aggregate human agreement with the labels is 96.4%.

Source: BLiMP

Homepage