General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
2,200 PAPERS • 57 BENCHMARKS
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
50 PAPERS • 6 BENCHMARKS
AmazonQA consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, additional annotations are collected, marking each question as either answerable or unanswerable based on the available reviews.
13 PAPERS • NO BENCHMARKS YET
ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.
11 PAPERS • NO BENCHMARKS YET
CQASUMM is a dataset for CQA (Community Question Answering) summarization, constructed from the 4.4 million Yahoo! Answers L6 dataset. The dataset contains ~300k annotated samples.
8 PAPERS • NO BENCHMARKS YET
PerCQA is the first Persian dataset for CQA (Community Question Answering). This dataset contains the questions and answers crawled from the most well-known Persian forum.
1 PAPER • NO BENCHMARKS YET