TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	DaNetQA	Random weighted	Accuracy	0.52	# 21
Question Answering	DaNetQA	majority_class	Accuracy	0.503	# 22
Question Answering	DaNetQA	heuristic majority	Accuracy	0.642	# 11
Natural Language Inference	LiDiRus	majority_class	MCC	0	# 19
Natural Language Inference	LiDiRus	heuristic majority	MCC	0.147	# 13
Natural Language Inference	LiDiRus	Random weighted	MCC	0	# 19
Reading Comprehension	MuSeRC	majority_class	Average F1	0.0	# 22
Reading Comprehension	MuSeRC	majority_class	EM	0.0	# 22
Reading Comprehension	MuSeRC	Random weighted	Average F1	0.45	# 21
Reading Comprehension	MuSeRC	Random weighted	EM	0.071	# 21
Reading Comprehension	MuSeRC	heuristic majority	Average F1	0.671	# 15
Reading Comprehension	MuSeRC	heuristic majority	EM	0.237	# 19
Common Sense Reasoning	PARus	Random weighted	Accuracy	0.48	# 20
Common Sense Reasoning	PARus	majority_class	Accuracy	0.498	# 15
Common Sense Reasoning	PARus	heuristic majority	Accuracy	0.478	# 21
Natural Language Inference	RCB	majority_class	Average F1	0.217	# 22
Natural Language Inference	RCB	majority_class	Accuracy	0.484	# 8
Natural Language Inference	RCB	Random weighted	Average F1	0.319	# 17
Natural Language Inference	RCB	Random weighted	Accuracy	0.374	# 22
Natural Language Inference	RCB	heuristic majority	Average F1	0.4	# 6
Natural Language Inference	RCB	heuristic majority	Accuracy	0.438	# 20
Common Sense Reasoning	RuCoS	Random weighted	Average F1	0.25	# 17
Common Sense Reasoning	RuCoS	Random weighted	EM	0.247	# 17
Common Sense Reasoning	RuCoS	majority_class	Average F1	0.25	# 17
Common Sense Reasoning	RuCoS	majority_class	EM	0.247	# 17
Common Sense Reasoning	RuCoS	heuristic majority	Average F1	0.26	# 15
Common Sense Reasoning	RuCoS	heuristic majority	EM	0.257	# 15
Word Sense Disambiguation	RUSSE	Random weighted	Accuracy	0.528	# 22
Word Sense Disambiguation	RUSSE	heuristic majority	Accuracy	0.595	# 15
Word Sense Disambiguation	RUSSE	majority_class	Accuracy	0.587	# 16
Common Sense Reasoning	RWSD	Random weighted	Accuracy	0.597	# 3
Common Sense Reasoning	RWSD	majority_class	Accuracy	0.669	# 8
Common Sense Reasoning	RWSD	heuristic majority	Accuracy	0.669	# 8
Natural Language Inference	TERRa	heuristic majority	Accuracy	0.549	# 17
Natural Language Inference	TERRa	majority_class	Accuracy	0.513	# 18
Natural Language Inference	TERRa	Random weighted	Accuracy	0.483	# 21

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/common-sense-reasoning-on-rwsd)](https://paperswithcode.com/sota/common-sense-reasoning-on-rwsd?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/natural-language-inference-on-rcb)](https://paperswithcode.com/sota/natural-language-inference-on-rcb?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/question-answering-on-danetqa)](https://paperswithcode.com/sota/question-answering-on-danetqa?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/natural-language-inference-on-lidirus)](https://paperswithcode.com/sota/natural-language-inference-on-lidirus?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/reading-comprehension-on-muserc)](https://paperswithcode.com/sota/reading-comprehension-on-muserc?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/common-sense-reasoning-on-parus)](https://paperswithcode.com/sota/common-sense-reasoning-on-parus?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/common-sense-reasoning-on-rucos)](https://paperswithcode.com/sota/common-sense-reasoning-on-rucos?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/word-sense-disambiguation-on-russe)](https://paperswithcode.com/sota/word-sense-disambiguation-on-russe?p=unreasonable-effectiveness-of-rule-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unreasonable-effectiveness-of-rule-based/natural-language-inference-on-terra)](https://paperswithcode.com/sota/natural-language-inference-on-terra?p=unreasonable-effectiveness-of-rule-based)`

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

3 May 2021 · Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov ·

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Common Sense Reasoning

Natural Language Inference

Natural Language Understanding

Question Answering

Reading Comprehension

Word Sense Disambiguation

Datasets

GLUE

SuperGLUE

WSC RWSD DaNetQA PARus TERRa RUSSE RCB MuSeRC LiDiRus RuCoS

Results from the Paper

Edit

Ranked #3 on Common Sense Reasoning on RWSD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	DaNetQA	Random weighted	Accuracy	0.52	# 21	Compare
Question Answering	DaNetQA	majority_class	Accuracy	0.503	# 22	Compare
Question Answering	DaNetQA	heuristic majority	Accuracy	0.642	# 11	Compare
Natural Language Inference	LiDiRus	majority_class	MCC	0	# 19	Compare
Natural Language Inference	LiDiRus	heuristic majority	MCC	0.147	# 13	Compare
Natural Language Inference	LiDiRus	Random weighted	MCC	0	# 19	Compare
Reading Comprehension	MuSeRC	majority_class	Average F1	0.0	# 22	Compare
Reading Comprehension	MuSeRC	majority_class	EM	0.0	# 22	Compare
Reading Comprehension	MuSeRC	Random weighted	Average F1	0.45	# 21	Compare
Reading Comprehension	MuSeRC	Random weighted	EM	0.071	# 21	Compare
Reading Comprehension	MuSeRC	heuristic majority	Average F1	0.671	# 15	Compare
Reading Comprehension	MuSeRC	heuristic majority	EM	0.237	# 19	Compare
Common Sense Reasoning	PARus	Random weighted	Accuracy	0.48	# 20	Compare
Common Sense Reasoning	PARus	majority_class	Accuracy	0.498	# 15	Compare
Common Sense Reasoning	PARus	heuristic majority	Accuracy	0.478	# 21	Compare
Natural Language Inference	RCB	majority_class	Average F1	0.217	# 22	Compare
Natural Language Inference	RCB	majority_class	Accuracy	0.484	# 8	Compare
Natural Language Inference	RCB	Random weighted	Average F1	0.319	# 17	Compare
Natural Language Inference	RCB	Random weighted	Accuracy	0.374	# 22	Compare
Natural Language Inference	RCB	heuristic majority	Average F1	0.4	# 6	Compare
Natural Language Inference	RCB	heuristic majority	Accuracy	0.438	# 20	Compare
Common Sense Reasoning	RuCoS	Random weighted	Average F1	0.25	# 17	Compare
Common Sense Reasoning	RuCoS	Random weighted	EM	0.247	# 17	Compare
Common Sense Reasoning	RuCoS	majority_class	Average F1	0.25	# 17	Compare
Common Sense Reasoning	RuCoS	majority_class	EM	0.247	# 17	Compare
Common Sense Reasoning	RuCoS	heuristic majority	Average F1	0.26	# 15	Compare
Common Sense Reasoning	RuCoS	heuristic majority	EM	0.257	# 15	Compare
Word Sense Disambiguation	RUSSE	Random weighted	Accuracy	0.528	# 22	Compare
Word Sense Disambiguation	RUSSE	heuristic majority	Accuracy	0.595	# 15	Compare
Word Sense Disambiguation	RUSSE	majority_class	Accuracy	0.587	# 16	Compare
Common Sense Reasoning	RWSD	Random weighted	Accuracy	0.597	# 3	Compare
Common Sense Reasoning	RWSD	majority_class	Accuracy	0.669	# 8	Compare
Common Sense Reasoning	RWSD	heuristic majority	Accuracy	0.669	# 8	Compare
Natural Language Inference	TERRa	heuristic majority	Accuracy	0.549	# 17	Compare
Natural Language Inference	TERRa	majority_class	Accuracy	0.513	# 18	Compare
Natural Language Inference	TERRa	Random weighted	Accuracy	0.483	# 21	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay • WordPiece

Edit Social Preview

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove