🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

1483 dataset results for Texts AND English

Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

11 PAPERS • NO BENCHMARKS YET

QAMPARI

QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.

11 PAPERS • NO BENCHMARKS YET

ReCAM (SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning)

Tasks Our shared task has three subtasks. Subtask 1 and 2 focus on evaluating machine learning models' performance with regard to two definitions of abstractness (Spreen and Schulz, 1966; Changizi, 2008), which we call imperceptibility and nonspecificity, respectively. Subtask 3 aims to provide some insights to their relationships.

11 PAPERS • 1 BENCHMARK

StylePTB

StylePTB is a fine-grained text style transfer benchmark. It consists of paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text, as well as compositions of multiple transfers which allow modelling of fine-grained stylistic changes as building blocks for more complex, high-level transfers.

11 PAPERS • NO BENCHMARKS YET

TRIPOD (TuRnIng POint Dataset)

TRIPOD contains screenplays and plot synopses with turning point (TP) annotations for 99 movies. Each movie contains:

11 PAPERS • NO BENCHMARKS YET

Talk the Walk

Talk The Walk is a large-scale dialogue dataset grounded in action and perception. The task involves two agents (a “guide” and a “tourist”) that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location.

11 PAPERS • NO BENCHMARKS YET

TaxiNLI

TaxiNLI is a dataset collected based on the principles and categorizations of the aforementioned taxonomy. A subset of examples are curated from MultiNLI (Williams et al., 2018) by sampling uniformly based on the entailment label and the domain. The dataset is annotated with finegrained category labels.

11 PAPERS • NO BENCHMARKS YET

ViGGO

The ViGGO corpus is a set of 6,900 meaning representation to natural language utterance pairs in the video game domain. The meaning representations are of 9 different dialogue acts.

11 PAPERS • 1 BENCHMARK

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home

11 PAPERS • 9 BENCHMARKS

BiRD (Bigram Relatedness Dataset)

Bigram Relatedness Dataset (BiRD) is a large, fine-grained, bigram relatedness dataset, using a comparative annotation technique called Best Worst Scaling. Each of BiRD's 3,345 English term pairs involves at least one bigram. BiRD is made freely available to foster further research on how meaning can be represented and how meaning can be composed.

10 PAPERS • NO BENCHMARKS YET

CIRCO (Composed Image Retrieval on Common Objects in context)

CIRCO (Composed Image Retrieval on Common Objects in context) is an open-domain benchmarking dataset for Composed Image Retrieval (CIR) based on real-world images from COCO 2017 unlabeled set. It is the first CIR dataset with multiple ground truths and aims to address the problem of false negatives in existing datasets. CIRCO comprises a total of 1020 queries, randomly divided into 220 and 800 for the validation and test set, respectively, with an average of 4.53 ground truths per query.

10 PAPERS • 1 BENCHMARK

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable when certain conditions apply.

10 PAPERS • 1 BENCHMARK

Duolingo STAPLE Shared Task

This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com

10 PAPERS • NO BENCHMARKS YET

FM-IQA (Freestyle Multilingual Image Question Answering)

FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.

10 PAPERS • NO BENCHMARKS YET

Fig-QA

Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.

10 PAPERS • NO BENCHMARKS YET

HINT3

HINT3 is a dataset for intent detection. It consists of 3 different datasets each containing a diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming named SOFMattress, Curekart and Powerplay11.

10 PAPERS • NO BENCHMARKS YET

LectureBank

LectureBank Dataset is a manually collected dataset of lecture slides. It contains 1,352 online lecture files from 60 courses covering 5 different domains, including Natural Language Processing (nlp), Machine Learning (ml), Artificial Intelligence (ai), Deep Learning (dl) and Information Retrieval (ir). In addition, it also contains the corresponding annotations for each slide.

10 PAPERS • NO BENCHMARKS YET

MINTAKA

MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.

10 PAPERS • NO BENCHMARKS YET

OCW (Only Connect Wall Dataset and creative problem solving tasks)

The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.

10 PAPERS • 1 BENCHMARK

OVEN (Open-domain Visual Entity Recognition)

In this project, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels.

10 PAPERS • 1 BENCHMARK

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 sentences.

10 PAPERS • 1 BENCHMARK

PhotoBook

A large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation.

10 PAPERS • NO BENCHMARKS YET

ReQA (Retrieval Question-Answering)

Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.

10 PAPERS • NO BENCHMARKS YET

SCICAP

SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.

10 PAPERS • 1 BENCHMARK

TED Gesture Dataset

Co-speech gestures are everywhere. People make gestures when they chat with others, give a public speech, talk on a phone, and even think aloud. Despite this ubiquity, there are not many datasets available. The main reason is that it is expensive to recruit actors/actresses and track precise body motions. There are a few datasets available (e.g., MSP AVATAR [17] and Personality Dyads Corpus [18]), but their sizes are limited to less than 3 h, and they lack diversity in speech content and speakers. The gestures also could be unnatural owing to inconvenient body tracking suits and acting in a lab environment.

10 PAPERS • 1 BENCHMARK

TekGen

The Dataset is part of the KELM corpus

10 PAPERS • 1 BENCHMARK

TheoremQA

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

10 PAPERS • 1 BENCHMARK

TimeDial

TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog, which is a multi-turn dialog corpus.

10 PAPERS • NO BENCHMARKS YET

X-CSQA

X-CSQA is a multilingual dataset for Commonsense reasoning research, based on CSQA.

10 PAPERS • NO BENCHMARKS YET

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers.

10 PAPERS • 3 BENCHMARKS

2012 i2b2 Temporal Relations

2012 i2b2 Temporal Relations (2012 i2b2 Temporal Relations Corpus)

The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the temporal relations in clinical narratives. The organizers provided the research community with a corpus of discharge summaries annotated with temporal information, to be used for the development and evaluation of temporal reasoning systems. 18 teams from around the world participated in the challenge. During the workshop, participating teams presented comprehensive reviews and analysis of their systems, and outlined future research directions suggested by the challenge contributions.

9 PAPERS • 2 BENCHMARKS

28 Ghz wireless channel dataset

Our dataset which consists of multiple indoor and outdoor experiments for up to 30 m gNB-UE link. In each experiment, we fixed the location of the gNB and move the UE with an increment of roughly one degrees. The table above specifies the direction of user movement with respect to gNB-UE link, distance resolution, and the number of user locations for which we conduct channel measurements. Outdoor 30 m data also contains blockage between 3.9 m to 4.8 m. At each location, we scan the transmission beam and collect data for each beam. By doing so, we can get the full OFDM channels for different locations along the moving trajectory with all the beam angles. Moreover, we use 240 kHz subcarrier spacing, which is consistent with the 5G NR numerology at FR2, so the data we collect will be a true reflection of what a 5G UE will see.

9 PAPERS • NO BENCHMARKS YET

ART Dataset

ART Dataset (Abductive Reasoning in narrative Text)

ART consists of over 20k commonsense narrative contexts and 200k explanations.

9 PAPERS • NO BENCHMARKS YET

ASAP

ASAP (Automated Student Assessment Prize)

There are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.

9 PAPERS • 1 BENCHMARK

Acronym Identification

Is an acronym disambiguation (AD) dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset.

9 PAPERS • NO BENCHMARKS YET

Ascent KB

This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the Max Planck Institute for Informatics. The focus of this dataset is on everyday concepts such as elephant, car, laptop, etc. The current version of Ascent KB (v1.0.0) is approximately 19 times larger than ConceptNet (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded).

9 PAPERS • NO BENCHMARKS YET

BabyLM

BabyLM is a dataset for small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, it provides a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome).

9 PAPERS • NO BENCHMARKS YET

BeaverTails

BeaverTails is a dataset aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, the authors have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both

9 PAPERS • NO BENCHMARKS YET

BenchLMM (BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models)

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of

9 PAPERS • 1 BENCHMARK

CMU-MOSI (Multimodal Corpus of Sentiment Intensity)

The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment in the range [-3,3]. The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features.

9 PAPERS • 2 BENCHMARKS

CQASUMM

CQASUMM is a dataset for CQA (Community Question Answering) summarization, constructed from the 4.4 million Yahoo! Answers L6 dataset. The dataset contains ~300k annotated samples.

9 PAPERS • NO BENCHMARKS YET

ClariQ

ClariQ is an extension of the Qulac dataset with additional new topics, questions, and answers in the training set. The test set is completely unseen and newly collected. Like Qulac, ClariQ consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations.

9 PAPERS • NO BENCHMARKS YET

Discovery

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.

9 PAPERS • 1 BENCHMARK

Earnings Call

The Earning Calls dataset consists of processed earning conference calls data (text and audio). It can be used to predict financial risk from both textual and vocal features from conference calls.

9 PAPERS • NO BENCHMARKS YET

LDC2020T02

LDC2020T02 (Abstract Meaning Representation (AMR) Annotation Release 3.0)

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

9 PAPERS • 2 BENCHMARKS

MultiEURLEX

MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.

9 PAPERS • NO BENCHMARKS YET

OpenASL

Large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers.

9 PAPERS • NO BENCHMARKS YET

PATS (Pose Audio Transcript Style)

PATS dataset consists of a diverse and large amount of aligned pose, audio and transcripts. With this dataset, we hope to provide a benchmark that would help develop technologies for virtual agents which generate natural and relevant gestures.

9 PAPERS • NO BENCHMARKS YET

Datasets

1483 dataset results for Texts AND English