🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

2894 dataset results for English

A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

19 PAPERS • NO BENCHMARKS YET

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.

18 PAPERS • 2 BENCHMARKS

CLOTH (CLOze test by TeacHers)

The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences

18 PAPERS • NO BENCHMARKS YET

COMETA

Consists of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph.

18 PAPERS • NO BENCHMARKS YET

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems

18 PAPERS • 1 BENCHMARK

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

18 PAPERS • 2 BENCHMARKS

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.

18 PAPERS • 1 BENCHMARK

Creative Writing

A creative writing task where the input is 4 random sentences and the output should be a coherent passage with 4 paragraphs that end in the 4 input sentences respectively. Such a task is open-ended and exploratory, and challenges creative thinking as well as high-level planning.

18 PAPERS • NO BENCHMARKS YET

Deezer-Europe

Node classification on Deezer Europe with 50%/25%/25% random splits for training/validation/test.

18 PAPERS • 1 BENCHMARK

FNC-1 (Fake News Challenge Stage 1)

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement

18 PAPERS • 2 BENCHMARKS

Funcom

Funcom is a collection of ~2.1 million Java methods and their associated Javadoc comments. This data set was derived from a set of 51 million Java methods and only includes methods that have an associated comment, comments that are in the English language, and has had auto-generated files removed. Each method/comment pair also has an associated method_uid and project_uid so that it is easy to group methods by their parent project.

18 PAPERS • NO BENCHMARKS YET

Geometry3K

A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams and text - 27,213 annotated diagram logic forms (literals) - 6,293 annotated text logic forms (literals)

18 PAPERS • 1 BENCHMARK

Groningen Meaning Bank

Groningen Meaning Bank is a semantic resource that anyone can edit and that integrates various semantic phenomena, including predicate-argument structure, scope, tense, thematic roles, animacy, pronouns, and rhetorical relations.

18 PAPERS • NO BENCHMARKS YET

HPS (Human POSEitioning System Dataset)

HPS Dataset is a collection of 3D humans interacting with large 3D scenes (300-1000 $m^2$, up to 2500 $m^2$). The dataset contains images captured from a head-mounted camera coupled with the reference 3D pose and location of the person in a pre-scanned 3D scene. 7 people in 8 large scenes are captured performing activities such as exercising, reading, eating, lecturing, using a computer, making coffee, dancing. The dataset provides more than 300K synchronized RGB images coupled with the reference 3D pose and location.

18 PAPERS • NO BENCHMARKS YET

LSHTC

LSHTC is a dataset for large-scale text classification. The data used in the LSHTC challenges originates from two popular sources: the DBpedia and the ODP (Open Directory Project) directory, also known as DMOZ. DBpedia instances were selected from the english, non-regional Extended Abstracts provided by the DBpedia site. The DMOZ instances consist of either Content vectors, Description vectors or both. A Content vectors is obtained by directly indexing the web page using standard indexing chain (preprocessing, stemming/lemmatization, stop-word removal).

18 PAPERS • NO BENCHMARKS YET

MED (Monotonicity Entailment Dataset)

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and collected from linguistics publications. The dataset was constructed by collecting naturally-occurring examples by crowdsourcing and well-designed ones from linguistics publications. It consists of 5,382 examples.

18 PAPERS • 1 BENCHMARK

NNE

NNE is a dataset for Nested Named Entity Recognition in English Newswire

18 PAPERS • 1 BENCHMARK

PreCo

A large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering.

18 PAPERS • 1 BENCHMARK

QAMR (Question-Answer Meaning Representation Dataset)

Question-Answer Meaning Representation (QAMR) represents a predicate-argument structure of a sentence with a set of question-answer pairs, so that annotations can be easily provided by non-experts. QAMR is a dataset of over 5,000 sentences and 100,000 questions created by crowdsourcing workers.

18 PAPERS • NO BENCHMARKS YET

Qulac

A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.

18 PAPERS • NO BENCHMARKS YET

Taskmaster-1

Taskmaster-1 is a dialog dataset consisting of 13,215 task-based dialogs in English, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.

18 PAPERS • NO BENCHMARKS YET

UMLS

UMLS (Unified Medical Language System)

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:

18 PAPERS • 1 BENCHMARK

VideoInstruct

VideoInstruct (Video Instruction Dataset)

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to

18 PAPERS • 6 BENCHMARKS

Wiki-ZSL

The Wiki-ZSL (Wiki Zero-Shot Learning) dataset contains 113 relations and 94,383 instances from Wikipedia. The dataset is divided into three subsets: training set (98 relations), validation set (5 relations) and test set (10 relations).

18 PAPERS • 2 BENCHMARKS

CLEVR-Humans

We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary words and require reasoning skills that are absent from our model’s repertoire

17 PAPERS • 1 BENCHMARK

DWIE

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

17 PAPERS • 4 BENCHMARKS

ETHOS

ETHOS (multi-labEl haTe speecH detectiOn dataSet)

ETHOS is a hate speech detection dataset. It is built from YouTube and Reddit comments validated through a crowdsourcing platform. It has two subsets, one for binary classification and the other for multi-label classification. The former contains 998 comments, while the latter contains fine-grained hate-speech annotations for 433 comments.

17 PAPERS • 2 BENCHMARKS

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

17 PAPERS • 2 BENCHMARKS

KoNViD-1k

KoNViD-1k (KoNViD-1k VQA Database)

Subjective video quality assessment (VQA) strongly depends on semantics, context, and the types of visual distortions. A lot of existing VQA databases cover small numbers of video sequences with artificial distortions. When testing newly developed Quality of Experience (QoE) models and metrics, they are commonly evaluated against subjective data from such databases, that are the result of perception experiments. However, since the aim of these QoE models is to accurately predict natural videos, these artificially distorted video databases are an insufficient basis for learning. Additionally, the small sizes make them only marginally usable for state-of-the-art learning systems, such as deep learning. In order to give a better basis for development and evaluation of objective VQA methods, we have created a larger datasets of natural, real-world video sequences with corresponding subjective mean opinion scores (MOS) gathered through crowdsourcing. We took YFCC100m as a baseline databas

17 PAPERS • 1 BENCHMARK

Nam

Nam (A holistic approach to cross-channel image noise modeling and its application to image denoising)

A holistic approach to cross-channel image noise modeling and its application to image denoising

17 PAPERS • 1 BENCHMARK

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. It is a partially annotated dataset, with 9,600 trainable classes

17 PAPERS • 3 BENCHMARKS

PKLot (A Robust Dataset for Parking Lot Classification)

The PKLot dataset contains 12,417 images of parking lots and 695,899 images of parking spaces segmented from them, which were manually checked and labeled. All images were acquired at the parking lots of the Federal University of Parana (UFPR) and the Pontificial Catholic University of Parana (PUCPR), both located in Curitiba, Brazil.

17 PAPERS • 1 BENCHMARK

ParaBank

A large-scale English paraphrase dataset that surpasses prior work in both quantity and quality.

17 PAPERS • NO BENCHMARKS YET

PhotoChat

PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context.

17 PAPERS • 2 BENCHMARKS

Quasimodo

Quasimodo is commonsense knowledge base that focuses on salient properties of objects. We provide several subsets:

17 PAPERS • NO BENCHMARKS YET

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

17 PAPERS • 2 BENCHMARKS

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Switchboard-1 Corpus

The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3.

17 PAPERS • 1 BENCHMARK

TempEval-3

TempEval-3 (TempEval-3: events, times, and temporal relations)

Within the SemEval-2013 evaluation exercise, the TempEval-3 shared task aims to advance research on temporal information processing. It follows on from TempEval-1 and -2, with: a three-part structure covering temporal expression, event, and temporal relation extraction; a larger dataset; and new single measures to rank systems – in each task and in general.

17 PAPERS • 2 BENCHMARKS

TopiOCQA

TopiOCQA (pronounced Tapioca) is an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. On average, a conversation in the dataset spans 13 question-answer turns and involves four topics (documents). TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history.

17 PAPERS • NO BENCHMARKS YET

United Nations Parallel Corpus

The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.

17 PAPERS • NO BENCHMARKS YET

Violin (VIdeO-and-Language INference)

Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.

17 PAPERS • NO BENCHMARKS YET

Wisconsin(60%/20%/20% random splits)

Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

iSarcasm

iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for one of the following types of ironic speech:

17 PAPERS • 1 BENCHMARK

ADNI

ADNI (Alzheimer's Disease NeuroImaging Initiative)

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment of Alzheimer’s disease (AD).[1] This cooperative study combines expertise and funding from the private and public sector to study subjects with AD, as well as those who may develop AD and controls with no signs of cognitive impairment.[2] Researchers at 63 sites in the US and Canada track the progression of AD in the human brain with neuroimaging, biochemical, and genetic biological markers.[2][3] This knowledge helps to find better clinical trials for the prevention and treatment of AD. ADNI has made a global impact,[4] firstly by developing a set of standardized protocols to allow the comparison of results from multiple centers,[4] and secondly by its data-sharing policy which makes available all at the data without embargo to qualified researchers worldwide.[5] To date, over 1000 scientific publications have used ADNI data.[6] A number of oth

16 PAPERS • 2 BENCHMARKS

CLEVR-Ref+

CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.

16 PAPERS • 2 BENCHMARKS

COVID-Fact

COVID-Fact is a FEVER-like dataset of claims concerning the COVID-19 pandemic. The dataset contains claims, evidence for the claims, and contradictory claims refuted by the evidence.

16 PAPERS • NO BENCHMARKS YET

Cornell (48%/32%/20% fixed splits)

Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.

16 PAPERS • 2 BENCHMARKS

Datasets

2894 dataset results for English