…Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
70 PAPERS • 1 BENCHMARK
…While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models.
17 PAPERS • NO BENCHMARKS YET