RGRS is a dataset for collaboratior recommendation on the ResearchGate academic social network. The data has been collected from Jan. 2019 to April 2019 and includes raw data of 3980 RG users.
1 PAPER • NO BENCHMARKS YET
SCIMAT is a large question-answer dataset for mathematics and science problems; such dataset can have impact on online education, intelligent tutoring and automated grading.
Curated QA Benchmark on State of the Union Address 2023. It contains curated question and answers based on knowledge presented in State of the Union Address 2023 (in Feb). It is especially useful for tool-augmented LMs / ALMs to examine the model's ability in answering over private document.
Stanford Question Answering Dataset (SQuAD) into Spanish.
SQuAD-it is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.
1 PAPER • 1 BENCHMARK
The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts of active users, who regularly participate, were extrapolated providing a new method of obtaining low-bias content and without privacy issues. This corpus i then processed to offer a SQUAD (Standford Question Answering Dataset) version in order to train a ML QA model.
ScienceExamCER is a collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions.
Recent applications of LLMs in Machine Reading Comprehension (MRC) systems have shown impressive results, but the use of shortcuts, mechanisms triggered by features spuriously correlated to the true label, has emerged as a potential threat to their reliability. We analyze the problem from two angles: LLMs as editors, guided to edit text to mislead LLMs; and LLMs as readers, who answer questions based on the edited text. We introduce a framework that guides an editor to add potential shortcuts-triggers to samples. Using GPT4 as the editor, we find it can successfully edit trigger shortcut in samples that fool LLMs. Analysing LLMs as readers, we observe that even capable LLMs can be deceived using shortcut knowledge. Strikingly, we discover that GPT4 can be deceived by its own edits (15% drop in F1). Our findings highlight inherent vulnerabilities of LLMs to shortcut manipulations. We publish ShortcutQA, a curated dataset generated by our framework for future research.
TinySocial is a dataset to enable research on Social Visual Question Answering.
The data consists of a set of 3 task types and 4 question types, creating 12 total scenarios. The tasks are grouped into stories, which are denoted by the numbering at the start of each line.
Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human’s cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided
The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.
The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.
VQA 360° is a dataset for visual question answering on 360° images containing around 17,000 real-world image-question-answer triplets for a variety of question types.
VTQA is a dataset containing open-ended questions about image-text pairs. This dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this dataset is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation. VTQA dataset consists of 10,238 image-text pairs and 27,317 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.
Visual Choice of Plausible Alternatives (VCOPA) is an evaluation dataset containing 380 VCOPA questions and over 1K images with various topics, which is amenable to automatic evaluation, and present the performance of baseline reasoning approaches as initial benchmarks for future systems.
The VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube - an extensive source of user-uploaded content, covering the topics of food and travel in the Vietnamese language. This dataset is used for research in Vietnamese Spoken-Based Machine Reading Comprehension.
A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.
To collect WikiSuggest, Google Suggest API is used to harvest natural language questions and submit them to Google Search. Whenever Google Search returns a box with a short answer from Wikipedia, an example from the question, answer, and the Wikipedia document are created. If the answer string is missing from the document this often implies a spurious question-answer pair, such as (‘what time is half time in rugby’, ‘80 minutes, 40 minutes’). Question-answer pairs without the exact answer string are pruned. Fifty examples after filtering are examined and 54% were found to be well-formed question-answer pairs where answers in the document can be grounded, 20% contained answers without textual evidence in the document (the answer string exists in an irreleveant context), and 26% contain incorrect QA pairs.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Xamarin Q&A consists of two datasets of questions and answers for studying the development of cross-platform mobile applications using the Xamarin framework. The two datasets were created by mining two Q&A sites: Xamarin Forum and Stack Overflow. The datasets have 85,908 questions mined from the Xamarin Forum and 44,434 from Stack Overflow.
We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose concatenated-bAbI (catbAbI): an infinite sequence of bAbI stories. catbAbI is generated from the bAbI dataset and during training, a random sample/story from any task is drawn without replacement and concatenated to the ongoing story. The preprocessig for catbAbI addresses several issues: it removes the supporting facts, leaves the questions embedded in the story, inserts the correct answer after the question mark, and tokenises the full sample into a single sequence of words. As such, catbAbI is designed to be trained in an autoregressive way and analogous to closed-book question answering.
1 PAPER • 2 BENCHMARKS
The simply-CLEVR dataset aims to provide a benchmark dataset that can be used for transparent quantitative evaluation of explanation methods (aka heatmaps/XAI methods). It is made of simple Visual Question Answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by two Ground Truth Masks that serve as a basis for evaluating explanations on the input image.
The Customer Support on Twitter dataset is a large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact.
0 PAPER • NO BENCHMARKS YET