Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

3 May 2021  ·  Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov ·

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering DaNetQA Random weighted Accuracy 0.52 # 21
Question Answering DaNetQA majority_class Accuracy 0.503 # 22
Question Answering DaNetQA heuristic majority Accuracy 0.642 # 11
Natural Language Inference LiDiRus majority_class MCC 0 # 19
Natural Language Inference LiDiRus heuristic majority MCC 0.147 # 13
Natural Language Inference LiDiRus Random weighted MCC 0 # 19
Reading Comprehension MuSeRC majority_class Average F1 0.0 # 22
EM 0.0 # 22
Reading Comprehension MuSeRC Random weighted Average F1 0.45 # 21
EM 0.071 # 21
Reading Comprehension MuSeRC heuristic majority Average F1 0.671 # 15
EM 0.237 # 19
Common Sense Reasoning PARus Random weighted Accuracy 0.48 # 20
Common Sense Reasoning PARus majority_class Accuracy 0.498 # 15
Common Sense Reasoning PARus heuristic majority Accuracy 0.478 # 21
Natural Language Inference RCB majority_class Average F1 0.217 # 22
Accuracy 0.484 # 8
Natural Language Inference RCB Random weighted Average F1 0.319 # 17
Accuracy 0.374 # 22
Natural Language Inference RCB heuristic majority Average F1 0.4 # 6
Accuracy 0.438 # 20
Common Sense Reasoning RuCoS Random weighted Average F1 0.25 # 17
EM 0.247 # 17
Common Sense Reasoning RuCoS majority_class Average F1 0.25 # 17
EM 0.247 # 17
Common Sense Reasoning RuCoS heuristic majority Average F1 0.26 # 15
EM 0.257 # 15
Word Sense Disambiguation RUSSE Random weighted Accuracy 0.528 # 22
Word Sense Disambiguation RUSSE heuristic majority Accuracy 0.595 # 15
Word Sense Disambiguation RUSSE majority_class Accuracy 0.587 # 16
Common Sense Reasoning RWSD Random weighted Accuracy 0.597 # 3
Common Sense Reasoning RWSD majority_class Accuracy 0.669 # 8
Common Sense Reasoning RWSD heuristic majority Accuracy 0.669 # 8
Natural Language Inference TERRa heuristic majority Accuracy 0.549 # 17
Natural Language Inference TERRa majority_class Accuracy 0.513 # 18
Natural Language Inference TERRa Random weighted Accuracy 0.483 # 21

Methods