Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human’s cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided
1 PAPER • NO BENCHMARKS YET
Recent advancements in large language models (LLMs) have led to their adoption across various applications, notably in combining LLMs with external content to generate responses. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. We introduce the first benchmark of indirect prompt injection attack, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. We hope that our benchmark and defenses can inspire future work in this important area.
Recent applications of LLMs in Machine Reading Comprehension (MRC) systems have shown impressive results, but the use of shortcuts, mechanisms triggered by features spuriously correlated to the true label, has emerged as a potential threat to their reliability. We analyze the problem from two angles: LLMs as editors, guided to edit text to mislead LLMs; and LLMs as readers, who answer questions based on the edited text. We introduce a framework that guides an editor to add potential shortcuts-triggers to samples. Using GPT4 as the editor, we find it can successfully edit trigger shortcut in samples that fool LLMs. Analysing LLMs as readers, we observe that even capable LLMs can be deceived using shortcut knowledge. Strikingly, we discover that GPT4 can be deceived by its own edits (15% drop in F1). Our findings highlight inherent vulnerabilities of LLMs to shortcut manipulations. We publish ShortcutQA, a curated dataset generated by our framework for future research.
Benchmark to evaluate the capability of LMs to consolidate and recall information from multiple training documents.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
26 PAPERS • NO BENCHMARKS YET
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
19 PAPERS • NO BENCHMARKS YET
PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please describe the molecule. We remove molecules that cannot be processed by RDKit [Landrum et al., 2021] to generate 2D molecular graphs. We also remove texts with less than 4 words, and crops descriptions with more than 256 words. Finally, we obtain 325, 754 unique molecules and 365, 129 molecule-text pairs. On average, each text description contains 17 words.
3 PAPERS • 1 BENCHMARK
UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists 4 types of questions with regard to functions, official names, protein families, and sub-cellular locations. We collect a total of 569, 516 proteins and 1, 891, 506 question-answering samples.
BeaverTails is a dataset aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, the authors have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both
9 PAPERS • NO BENCHMARKS YET
ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.
7 PAPERS • NO BENCHMARKS YET
CompMix is a crowdsourced QA benchmark which naturally demands the integration of a mixture of input sources. CompMix has a total of 9,410 questions, and features several complex intents like joins and temporal conditions.
BDD-QA is distinguished by its encompassing range of traffic actions, crafted to rigorously evaluate a model's decision-making abilities in traffic scenario. This makes it a potent tool for high-level decision-making research within traffic contexts, including autonomous driving developments.
Curated QA Benchmark on State of the Union Address 2023. It contains curated question and answers based on knowledge presented in State of the Union Address 2023 (in Feb). It is especially useful for tool-augmented LMs / ALMs to examine the model's ability in answering over private document.
We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.
10 PAPERS • 1 BENCHMARK
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
7 PAPERS • 9 BENCHMARKS
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
14 PAPERS • NO BENCHMARKS YET
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
3 PAPERS • NO BENCHMARKS YET
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.
6 PAPERS • NO BENCHMARKS YET
DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE is characterized equally as a Multi-Domain Long-Tailed Recognition problem
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
14 PAPERS • 2 BENCHMARKS
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
2 PAPERS • NO BENCHMARKS YET
AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models for Link Prediction and Question Answering
1 PAPER • 1 BENCHMARK
KAMEL comprises knowledge about 234 relations from Wikidata with a large training, validation, and test dataset. We make sure that all facts are also present in Wikipedia so that they have been seen during the pre-training procedure of the LMs we are probing. Most importantly we overcome the limitations of existing probing datasets by (1) having a larger variety of knowledge graph relations, (2) it contains single- and multi-token entities, (3) we use relations with literals, and (4) have alternative labels for entities. (5) Furthermore, we created an evaluation procedure for higher cardinality relations, which was missing in previous works, and (6) make sure that the dataset can be used for causal LMs.
5 PAPERS • 1 BENCHMARK
Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.
ReSQ is a real-world Spatial Question Answering dataset with human-generated questions built on an existing corpus with SpRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations.
SpaRTUN a dataset synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL).
RoMQA is a benchmark for robust, multi-evidence, and multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. The dataset evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster.
CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos.
7 PAPERS • 1 BENCHMARK
MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.
10 PAPERS • NO BENCHMARKS YET
Science Question Answering (ScienceQA) is a new benchmark that consists of 21,208 multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. Out of the questions in ScienceQA, 10,332 (48.7%) have an image context, 10,220 (48.2%) have a text context, and 6,532 (30.8%) have both. Most questions are annotated with grounded lectures (83.9%) and detailed explanations (90.5%). The lecture and explanation provide general external knowledge and specific reasons, respectively, for arriving at the correct answer. To the best of our knowledge, ScienceQA is the first large-scale multimodal dataset that annotates lectures and explanations for the answers.
141 PAPERS • 1 BENCHMARK
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
RadQA is a radiology question answering dataset with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines.
QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.
11 PAPERS • NO BENCHMARKS YET
Contains over 70,000 question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR).
The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.
22 PAPERS • 1 BENCHMARK
The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.
8 PAPERS • NO BENCHMARKS YET
ConcurrentQA is a textual multi-hop QA benchmark to require concurrent retrieval over multiple data-distributions (i.e. Wikipedia and email data). The dataset follow the exact same schema and design as HotpotQA. The data set is downloadable here: https://github.com/facebookresearch/concurrentqa. It also contains model and result analysis code. This benchmark can also be used to study privacy when reasoning over data distributed in multiple privacy scopes --- i.e. Wikipedia in the public domain and emails in the private domain.
2 PAPERS • 1 BENCHMARK
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
8 PAPERS • 1 BENCHMARK
MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.
3 PAPERS • 6 BENCHMARKS
A large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is a crowdsourced question answering (QA) dataset on the ocean and the Brazilian coast designed for reading comprehension.
QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.
SCROLLS (Standardized CompaRison Over Long Language Sequences) is an NLP benchmark consisting of a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. The dataset is made available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
31 PAPERS • 1 BENCHMARK
We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none dependin
Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. PQ-decaNLP is a crowd-sourced corpus of paraphrased questions, annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation.