Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Question Answering CheGeKa RuGPT-3 Small Accuracy 00 # 2
Question Answering CheGeKa Human benchmark Accuracy 64.5 # 1
Question Answering CheGeKa RuGPT-3 Large Accuracy 00 # 2
Question Answering CheGeKa RuGPT-3 Medium Accuracy 00 # 2
Ethics Ethics RuGPT-3 Small Accuracy 55.5 # 3
Ethics Ethics RuGPT-3 Meduim Accuracy 68.3 # 2
Ethics Ethics RuGPT-3 Large Accuracy 68.6 # 1
Ethics Ethics Human benchmark Accuracy 52.9 # 4
Ethics Ethics (per ethics) Human benchmark Accuracy 67.6 # 1
Ethics Ethics (per ethics) RuGPT-3 Small Accuracy 60.9 # 2
Ethics Ethics (per ethics) RuGPT-3 Medium Accuracy 44.1 # 4
Ethics Ethics (per ethics) RuGPT-3 Large Accuracy 44.9 # 3
Question Answering MultiQ Human benchmark Accuracy 91.0 # 1
Question Answering MultiQ RuGPT-3 Small Accuracy 00 # 2
Question Answering MultiQ RuGPT-3 Medium Accuracy 00 # 2
Question Answering MultiQ RuGPT-3 Large Accuracy 00 # 2
Question Answering RuOpenBookQA RuGPT-3 Small Accuracy 57.9 # 2
Question Answering RuOpenBookQA Human benchmark Accuracy 86.5 # 1
Question Answering RuOpenBookQA RuGPT-3 Large Accuracy 55.5 # 4
Question Answering RuOpenBookQA RuGPT-3 Medium Accuracy 57.2 # 3
Logical Reasoning RuWorldTree Human benchmark Accuracy 83.7 # 1
Logical Reasoning RuWorldTree RuGPT-3 Large Accuracy 40.7 # 2
Logical Reasoning RuWorldTree RuGPT-3 Small Accuracy 34.0 # 4
Logical Reasoning RuWorldTree RuGPT-3 Medium Accuracy 38.0 # 3
Logical Reasoning Winograd Automatic Human benchmark Accuracy 87.0 # 1
Logical Reasoning Winograd Automatic RuGPT-3 Small Accuracy 57.9 # 2
Logical Reasoning Winograd Automatic RuGPT-3 Medium Accuracy 57.2 # 3
Logical Reasoning Winograd Automatic RuGPT-3 Large Accuracy 55.5 # 4

Methods


No methods listed for this paper. Add relevant methods here