How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Coreference Resolution Winograd Schema Challenge GPT-2 Medium 774M (partial scoring) Accuracy 69.2 # 36
Coreference Resolution Winograd Schema Challenge GPT-2 Medium 774M (full scoring) Accuracy 64.5 # 43
Coreference Resolution Winograd Schema Challenge GPT-2 Small 117M (partial scoring) Accuracy 61.5 # 54
Coreference Resolution Winograd Schema Challenge GPT-2 Small 117M (full scoring) Accuracy 55.7 # 69

Methods


No methods listed for this paper. Add relevant methods here