🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

25 dataset results for Retrieval

OK-VQA (Outside Knowledge Visual Question Answering)

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.

259 PAPERS • 2 BENCHMARKS

Polyvore (Polyvore Outfits)

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing.

56 PAPERS • 3 BENCHMARKS

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship type information.

22 PAPERS • NO BENCHMARKS YET

ARO

Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.

21 PAPERS • NO BENCHMARKS YET

ALCE (Automatic LLMs' Citation Evaluation)

ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.

18 PAPERS • NO BENCHMARKS YET

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

17 PAPERS • 2 BENCHMARKS

QAMPARI

QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.

11 PAPERS • NO BENCHMARKS YET

DIOR-RSVG

DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.

7 PAPERS • NO BENCHMARKS YET

MVK (Marine Video Kit)

The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 diﬀerent regions and countries during the time from 2011 to 2022.

7 PAPERS • 1 BENCHMARK

BSARD (Belgian Statutory Article Retrieval Dataset)

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

6 PAPERS • 1 BENCHMARK

xCodeEval

xCodeEval is one of the largest executable multilingual multitask benchmarks consisting of 17 programming languages with execution-level parallelism. It features a total of seven tasks involving code understanding, generation, translation, and retrieval, and it employs an execution-based evaluation instead of traditional lexical approaches. It also provides a test-case-based multilingual code execution engine, ExecEval that supports all the programming languages in xCodeEval.

6 PAPERS • NO BENCHMARKS YET

ComFact

ComFact is a benchmark for commonsense fact linking, where models are given contexts and trained to identify situationally-relevant commonsense knowledge from KGs. The novel benchmark, C-om-Fact, contains ∼293k in-context relevance annotations for common-sense triplets across four stylistically diverse dialogue and storytelling datasets.

5 PAPERS • NO BENCHMARKS YET

ProofNet

ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics. The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3, a natural language theorem statement, and natural language proof. The problems are primarily drawn from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.

5 PAPERS • NO BENCHMARKS YET

DiSCQ (Discharge Summary Clinical Questions)

DiSCQ is a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. This dataset is released to facilitate further research into realistic clinical Question Answering (QA) and Question Generation (QG).

4 PAPERS • NO BENCHMARKS YET

COSIAN (a collection of singing voice annotation)

COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.

2 PAPERS • NO BENCHMARKS YET

LLeQA (Long-form Legal Question Answering)

LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.

2 PAPERS • NO BENCHMARKS YET

RoMQA

RoMQA is a benchmark for robust, multi-evidence, and multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. The dataset evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster.

2 PAPERS • NO BENCHMARKS YET

ShapeTalk (The ShapeTalk Dataset)

ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances

2 PAPERS • NO BENCHMARKS YET

CoreSearch

CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.

1 PAPER • NO BENCHMARKS YET

FanOutQA

FanOutQA is a high quality, multi-hop, multi-document benchmark for large language models using English Wikipedia as its knowledge base. Compared to other question-answering benchmarks, FanOutQA requires reasoning over a greater number of documents, with the benchmark's main focus being on the titular fan-out style of question. We present these questions in three tasks -- closed-book, open-book, and evidence-provided -- which measure different abilities of LLM systems.

1 PAPER • NO BENCHMARKS YET

FewDR

FewDR is a dataset for Few-shot dense retrieval (DR). FewDR aims to effectively generalize to novel search scenarios by learning a few samples. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds.

1 PAPER • NO BENCHMARKS YET

PTVD

PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.

1 PAPER • NO BENCHMARKS YET

PoseScript

PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.

1 PAPER • NO BENCHMARKS YET

Spiced

Spiced is a paraphrase dataset of scientific findings annotated for degree of information change. Spiced contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.

1 PAPER • NO BENCHMARKS YET

Statcan Dialogue Dataset

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

1 PAPER • 1 BENCHMARK

Datasets

25 dataset results for Retrieval