TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Coreference Resolution	Winograd Schema Challenge	GPT-2 Medium 774M (partial scoring)	Accuracy	69.2	# 36
Coreference Resolution	Winograd Schema Challenge	GPT-2 Medium 774M (full scoring)	Accuracy	64.5	# 43
Coreference Resolution	Winograd Schema Challenge	GPT-2 Small 117M (partial scoring)	Accuracy	61.5	# 54
Coreference Resolution	Winograd Schema Challenge	GPT-2 Small 117M (full scoring)	Accuracy	55.7	# 69

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/on-the-evaluation-of-common-sense-reasoning/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=on-the-evaluation-of-common-sense-reasoning)`

How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

IJCNLP 2019 · Paul Trichelair, Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung ·

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract

Code

Add Remove Mark official

ptrichel/How-Reasonable-are-Common-… official

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Datasets

WSC

SWAG

Results from the Paper

Edit

Ranked #36 on Coreference Resolution on Winograd Schema Challenge

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Coreference Resolution	Winograd Schema Challenge	GPT-2 Medium 774M (partial scoring)	Accuracy	69.2	# 36	Compare
Coreference Resolution	Winograd Schema Challenge	GPT-2 Medium 774M (full scoring)	Accuracy	64.5	# 43	Compare
Coreference Resolution	Winograd Schema Challenge	GPT-2 Small 117M (partial scoring)	Accuracy	61.5	# 54	Compare
Coreference Resolution	Winograd Schema Challenge	GPT-2 Small 117M (full scoring)	Accuracy	55.7	# 69	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove