…Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
71 PAPERS • NO BENCHMARKS YET
…The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models (at least the ones used as adversaries in the annotation loop) find challenging
24 PAPERS • 2 BENCHMARKS
…While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models.
19 PAPERS • NO BENCHMARKS YET