The Standardized Project Gutenberg Corpus (SPGC) is an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens.
5 PAPERS • NO BENCHMARKS YET
The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.
4 PAPERS • NO BENCHMARKS YET
CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.
4 PAPERS • 2 BENCHMARKS
GoodSounds dataset contains around 28 hours of recordings of single notes and scales played by 15 different professional musicians, all of them holding a music degree and having some expertise in teaching. 12 different instruments (flute, cello, clarinet, trumpet, violin, alto sax alto, tenor sax, baritone sax, soprano sax, oboe, piccolo and bass) were recorded using one or up to 4 different microphones. For all the instruments the whole set of playable semitones in the instrument is recorded several times with different tonal characteristics. Each note is recorded into a separate monophonic audio file of 48kHz and 32 bits. Rich annotations of the recordings are available, including details on recording environment and rating on tonal qualities of the sound (“good-sound”, “bad”, “scale-good”, “scale-bad”).
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.
This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.
4 PAPERS • 1 BENCHMARK
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
3 PAPERS • NO BENCHMARKS YET
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.
Grep-BiasIR is a novel thoroughly-audited dataset which aim to facilitate the studies of gender bias in the retrieved results of IR systems.
MuMu is a new dataset of more than 31k albums classified into 250 genre classes.
NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.
3 PAPERS • 1 BENCHMARK
A corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases.
SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.
3 PAPERS • 2 BENCHMARKS
BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search.
2 PAPERS • NO BENCHMARKS YET
COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.
HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR). The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.
A new large-scale retail product dataset for fine-grained image classification. Unlike previous datasets focusing on relatively few products, more than 500,000 images of retail products on shelves were collected, belonging to 2000 different products. The dataset aims to advance the research in retail object recognition, which has massive applications such as automatic shelf auditing and image-based product information retrieval.
ReSQ is a real-world Spatial Question Answering dataset with human-generated questions built on an existing corpus with SpRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations.
The TREC News Track features modern search tasks in the news domain. In partnership with The Washington Post, we are developing test collections that support the search needs of news readers and news writers in the current news environment. It's our hope that the track will foster research that establishes a new sense for what "relevance" means for news search.
2 PAPERS • 1 BENCHMARK
This dataset is used for the task of conversational document prediction. The dataset includes conversations that occurred between users and customer care agents in 25 organizations on the Twitter platform. Each conversation ends with a customer care agent providing a URL to a document to resolve the issue the user is facing. The task is to predict the document given a dialog context. The train, dev and test datasets include 10000, 525 and 500 conversations respectively.
This paper is a condensed report on the second year of the Touché shared task on argument retrieval held at CLEF 2021. With the goal to provide a collaborative platform for researchers, we organized two tasks: (1) supporting individuals in finding arguments on controversial topics of social importance and (2) supporting individuals with arguments in personal everyday comparison situations.
A set of 248 search queries annotated with the correct diagnosis. The diagnosis is referenced with a Concept Unique Identifier (CUI). In a retrieval setting, the task consists of retrieving an article from the FindZebra corpus with a CUI that matches the query CUI.
1 PAPER • NO BENCHMARKS YET
This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).
A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.
1 PAPER • 1 BENCHMARK
The Persian Reverse Dictionary Dataset is a collection of 855217 words along with the phrases describing them. The phrases were extracted from the top three most well-known Persian dictionaries (including Amid, Moeen, and Dehkhoda), Persian Wikipedia, and a Persian Wordnet (called Farsnet).
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
Urdu News Headlines Dataset with VOA and BBC An Urdu news headlines dataset is a collection of news headlines in the Urdu language, typically scraped from news websites and social media platforms. These datasets can be valuable for researchers and developers working on a variety of tasks, such as:
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.
WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction.
MedleyDB 2.0 is a superset of the MedleyDB – a dataset of annotated, royalty-free multitrack recordings. The second iteration of the dataset includes 74 new multitrack recordings resulting in 194 songs in total.
0 PAPER • NO BENCHMARKS YET
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.