HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
437 PAPERS • 1 BENCHMARK
…Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
71 PAPERS • NO BENCHMARKS YET
…Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries.
1 PAPER • NO BENCHMARKS YET
…Biology, Astronomy, Geology, Computer Science, Engineering, Environmental Science, Neuroscience, Robotics | | History and Culture | Ancient History, Medieval History, Modern History, World History, Art
6 PAPERS • 1 BENCHMARK