TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Coreference Resolution	Winograd Schema Challenge	Random chance baseline	Accuracy	50	# 77
Coreference Resolution	Winograd Schema Challenge	ALBERT-xxlarge 235M	Accuracy	78.8	# 24
Coreference Resolution	Winograd Schema Challenge	ALBERT-base 11M	Accuracy	55.4	# 70
Coreference Resolution	Winograd Schema Challenge	RoBERTa-large 354M	Accuracy	73.9	# 28
Coreference Resolution	Winograd Schema Challenge	RoBERTa-base 125M	Accuracy	63	# 47
Coreference Resolution	Winograd Schema Challenge	BERT-large 340M	Accuracy	61.4	# 56
Coreference Resolution	Winograd Schema Challenge	BERT-base 110M	Accuracy	56.5	# 68
Common Sense Reasoning	WinoGrande	ALBERT-xxlarge 235M	Accuracy	58.7	# 50
Common Sense Reasoning	WinoGrande	BERT-large 345M	Accuracy	55.6	# 58
Common Sense Reasoning	WinoGrande	ALBERT-base 11M	Accuracy	52.8	# 66
Common Sense Reasoning	WinoGrande	Random baseline	Accuracy	50	# 72
Common Sense Reasoning	WinoGrande	RoBERTa-large 355M	Accuracy	54.9	# 61
Common Sense Reasoning	WinoGrande	BERT-base 110M	Accuracy	53.1	# 65
Common Sense Reasoning	WinoGrande	RoBERTa-base 125M	Accuracy	56.3	# 55

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/back-to-square-one-bias-detection-training/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=back-to-square-one-bias-detection-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/back-to-square-one-bias-detection-training/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=back-to-square-one-bias-detection-training)`

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

EMNLP 2021 · Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth ·

The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Artifact Detection

Bias Detection

Common Sense Reasoning

Coreference Resolution

Disentanglement

Language Modelling

Datasets

WinoGrande

WSC

Results from the Paper

Edit

Ranked #24 on Coreference Resolution on Winograd Schema Challenge

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Coreference Resolution	Winograd Schema Challenge	Random chance baseline	Accuracy	50	# 77	Compare
Coreference Resolution	Winograd Schema Challenge	ALBERT-xxlarge 235M	Accuracy	78.8	# 24	Compare
Coreference Resolution	Winograd Schema Challenge	ALBERT-base 11M	Accuracy	55.4	# 70	Compare
Coreference Resolution	Winograd Schema Challenge	RoBERTa-large 354M	Accuracy	73.9	# 28	Compare
Coreference Resolution	Winograd Schema Challenge	RoBERTa-base 125M	Accuracy	63	# 47	Compare
Coreference Resolution	Winograd Schema Challenge	BERT-large 340M	Accuracy	61.4	# 56	Compare
Coreference Resolution	Winograd Schema Challenge	BERT-base 110M	Accuracy	56.5	# 68	Compare
Common Sense Reasoning	WinoGrande	ALBERT-xxlarge 235M	Accuracy	58.7	# 50	Compare
Common Sense Reasoning	WinoGrande	BERT-large 345M	Accuracy	55.6	# 58	Compare
Common Sense Reasoning	WinoGrande	ALBERT-base 11M	Accuracy	52.8	# 66	Compare
Common Sense Reasoning	WinoGrande	Random baseline	Accuracy	50	# 72	Compare
Common Sense Reasoning	WinoGrande	RoBERTa-large 355M	Accuracy	54.9	# 61	Compare
Common Sense Reasoning	WinoGrande	BERT-base 110M	Accuracy	53.1	# 65	Compare
Common Sense Reasoning	WinoGrande	RoBERTa-base 125M	Accuracy	56.3	# 55	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove