🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

82 dataset results for Information Retrieval

HiREST (HIerarchical REtrieval and STep-captioning)

HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

2 PAPERS • NO BENCHMARKS YET

ClueWeb22

ClueWeb22 is the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier CLUEWeb corpora, the ClUEWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, the dataset includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text.

5 PAPERS • NO BENCHMARKS YET

SciRepEval

SciRepEval is a comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search.

5 PAPERS • NO BENCHMARKS YET

Goal

Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.

3 PAPERS • NO BENCHMARKS YET

COSIAN (a collection of singing voice annotation)

COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.

2 PAPERS • NO BENCHMARKS YET

ResQ (Real-world Spatial Question Answering)

ReSQ is a real-world Spatial Question Answering dataset with human-generated questions built on an existing corpus with SpRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations.

2 PAPERS • NO BENCHMARKS YET

MTEB (Massive Text Embedding Benchmark)

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.

51 PAPERS • 8 BENCHMARKS

FZ queries (FindZebra queries)

A set of 248 search queries annotated with the correct diagnosis. The diagnosis is referenced with a Concept Unique Identifier (CUI). In a retrieval setting, the task consists of retrieving an article from the FindZebra corpus with a CUI that matches the query CUI.

1 PAPER • NO BENCHMARKS YET

SV-Ident (Survey Variable Identification)

SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.

3 PAPERS • 2 BENCHMARKS

PANACEA

PANACEA (PANACEA dataset - Heterogeneous COVID-19 Claims)

The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

0 PAPER • NO BENCHMARKS YET

Phrase-in-Context

Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

1 PAPER • NO BENCHMARKS YET

ORCAS-I

ORCAS-I (Queries Annotated with Intent using Weak Supervision)

A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.

1 PAPER • 1 BENCHMARK

WANDS

WANDS (Wayfair ANnotation Dataset)

The dataset contains:

7 PAPERS • NO BENCHMARKS YET

Grep-BiasIR

Grep-BiasIR (Gender Representation-Bias for Information Retrieval)

Grep-BiasIR is a novel thoroughly-audited dataset which aim to facilitate the studies of gender bias in the retrieved results of IR systems.

3 PAPERS • NO BENCHMARKS YET

BSARD (Belgian Statutory Article Retrieval Dataset)

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

6 PAPERS • 1 BENCHMARK

WikiPII

WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction.

1 PAPER • NO BENCHMARKS YET

Persian Reverse Dictionary Dataset

The Persian Reverse Dictionary Dataset is a collection of 855217 words along with the phrases describing them. The phrases were extracted from the top three most well-known Persian dictionaries (including Amid, Moeen, and Dehkhoda), Persian Wikipedia, and a Persian Wordnet (called Farsnet).

1 PAPER • NO BENCHMARKS YET

TripClick

TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine.

15 PAPERS • NO BENCHMARKS YET

MetaCLIR

This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).

1 PAPER • NO BENCHMARKS YET

TREC-COVID

TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the key characteristics of pandemic search is the accelerated rate of change: the topics of interest evolve as the pandemic progresses and the scientific literature in the area explodes. The COVID-19 pandemic provides an opportunity to capture this progression as it happens. TREC-COVID, in creating a test collection around COVID-19 literature, is building infrastructure to support new research and technologies in pandemic search.

64 PAPERS • 1 BENCHMARK

CORD-19

CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.

157 PAPERS • 2 BENCHMARKS

QASC (Question Answering via Sentence Composition)

QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.

99 PAPERS • NO BENCHMARKS YET

JuICe (JuICe Dataset)

JuICe is a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data.

13 PAPERS • NO BENCHMARKS YET

CCPE-M

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

3 PAPERS • NO BENCHMARKS YET

ReQA (Retrieval Question-Answering)

Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.

10 PAPERS • NO BENCHMARKS YET

Bach Doodle

The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.

4 PAPERS • NO BENCHMARKS YET

MSSD

MSSD (Music Streaming Sessions Dataset)

The Spotify Music Streaming Sessions Dataset (MSSD) consists of 160 million streaming sessions with associated user interactions, audio features and metadata describing the tracks streamed during the sessions, and snapshots of the playlists listened to during the sessions.

5 PAPERS • 1 BENCHMARK

Large-Scale CLIR Dataset

The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR). The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.

2 PAPERS • NO BENCHMARKS YET

GuitarSet

GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The 360 excerpts are the result of the following combinations:

24 PAPERS • NO BENCHMARKS YET

OpenMIC-2018

OpenMIC-2018 is an instrument recognition dataset containing 20,000 examples of Creative Commons-licensed music available on the Free Music Archive. Each example is a 10-second excerpt which has been partially labeled for the presence or absence of 20 instrument classes by annotators on a crowd-sourcing platform.

7 PAPERS • 1 BENCHMARK

MuMu

MuMu is a new dataset of more than 31k albums classified into 250 genre classes.

4 PAPERS • NO BENCHMARKS YET

FMA (Free Music Archive)

The Free Music Archive (FMA) is a large-scale dataset for evaluating several tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies.

95 PAPERS • 2 BENCHMARKS

QUASAR-S (QUestion Answering by Search And Reading – Stack Overflow)

QUASAR-S is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 37,362 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The answer to each question is restricted to be another software entity, from an output vocabulary of 4874 entities.

8 PAPERS • NO BENCHMARKS YET

WikiReading

WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).

26 PAPERS • NO BENCHMARKS YET

MS MARCO (Microsoft Machine Reading Comprehension Dataset)

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

823 PAPERS • 7 BENCHMARKS

i2b2 De-identification Dataset

i2b2 De-identification Dataset (Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset)

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

4 PAPERS • 1 BENCHMARK

Urdu News Headlines Dataset

Urdu News Headlines Dataset with VOA and BBC An Urdu news headlines dataset is a collection of news headlines in the Urdu language, typically scraped from news websites and social media platforms. These datasets can be valuable for researchers and developers working on a variety of tasks, such as:

1 PAPER • 1 BENCHMARK

BioASQ (Biomedical Semantic Indexing and Question Answering)

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).

163 PAPERS • 2 BENCHMARKS

iKala

The iKala dataset is a singing voice separation dataset that comprises of 252 30-second excerpts sampled from 206 iKala songs (plus 100 hidden excerpts reserved for MIREX data mining contest). The music accompaniment and the singing voice are recorded at the left and right channels respectively. Additionally, the human-labeled pitch contours and timestamped lyrics are provided.

20 PAPERS • 1 BENCHMARK

WikiCLIR

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.

4 PAPERS • NO BENCHMARKS YET

MedleyDB

MedleyDB, is a dataset of annotated, royalty-free multitrack recordings. It was curated primarily to support research on melody extraction. For each song melody f₀ annotations are provided as well as instrument activations for evaluating automatic instrument recognition. The original dataset consists of 122 multitrack songs out of which 108 include melody annotations.

41 PAPERS • NO BENCHMARKS YET

WMT 2014 Medical

WMT 2014 Medical (WMT 2014 Medical Translation Task)

The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.

1 PAPER • NO BENCHMARKS YET

BoostCLIR

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search.

2 PAPERS • NO BENCHMARKS YET

MSLR-WEB30K

The MSLR-WEB30K dataset consists of 30,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair.

31 PAPERS • 1 BENCHMARK

MQ2008

The MQ2008 dataset is a dataset for Learning to Rank. It contains 800 queries with labelled documents.

27 PAPERS • NO BENCHMARKS YET

MSLR-WEB10K

The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.

35 PAPERS • NO BENCHMARKS YET

George Washington

The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.

19 PAPERS • NO BENCHMARKS YET

Learning to Rank Challenge (Yahoo! Learning to Rank Challenge)

The Yahoo! Learning to Rank Challenge dataset consists of 709,877 documents encoded in 700 features and sampled from query logs of the Yahoo! search engine, spanning 29,921 queries.

24 PAPERS • NO BENCHMARKS YET

Datasets

82 dataset results for Information Retrieval