The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. Twenty-three recordings were chosen at random from a set of 4000 24-hour ambulatory ECG recordings collected from a mixed population of inpatients (about 60%) and outpatients (about 40%) at Boston's Beth Israel Hospital; the remaining 25 recordings were selected from the same set to include less common but clinically significant arrhythmias that would not be well-represented in a small random sample.
28 PAPERS • 5 BENCHMARKS
Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce. While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document skill development in higher education at a similar granularity. Here, we fill this gap by presenting a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct this dataset, we apply natural language processing to extract from course descriptions detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these DWAs to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college-educated workers and their role in the economy. To showcase the
0 PAPER • NO BENCHMARKS YET
The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is a multi-task dataset and allows for 5 subtasks: (i) Document-level Relation Extraction, (ii) Mention Detection, (iii) Entity Typing, (iv) Entity Disambiguation, (v) Coreference Resolution, as well as combinations thereof such as Named Entity Recognition (NER) or Entity Linking. The DocRED-IE dataset also allows for the end-to-end tasks of: (i) DocIE and (ii) Joint Entity and Relation Extraction. DocRED-IE comprises sentence-level and document-level facts, thereby describing short as well as long-range interactions within an entire document.
1 PAPER • 6 BENCHMARKS
This is the replication package for our systematic literature review and can be used for the reproducibility of the individual steps of our search and selection methodology.
1 PAPER • NO BENCHMARKS YET
Dataset Generation
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
The dataset concerns toy tasks that a human should teach to a robot. The number of task repetitions is limited in the dataset since the human should demonstrate the task to the robot only a few times.
UruDendro is a database of wood cross section images of commercially grown Pinus taeda trees from northern Uruguay. It is form by 64 RGB wood images, their rings delineations and pith location.
3 PAPERS • NO BENCHMARKS YET
MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.
The Innodata Red Teaming Prompts aims to rigorously assess models’ factuality and safety. This dataset, due to its manual creation and breadth of coverage, facilitates a comprehensive examination of LLM performance across diverse scenarios.
1 PAPER • 1 BENCHMARK
LLM-Seg40K dataset contains 14K images in total. The dataset is divided into training, validation, and test sets, containing 11K, 1K, and 2K images respectively. For the training split, each image has 3.95 questions on average and the average question question length is 15.2 words. The training set contains 1458 different categories in total.
OpenTrench3D, the first publicly available point cloud dataset of underground utilities from open trenches. It features 310 fully annotated point clouds consisting of a total of 528 million points categorised into 5 unique classes. OpenTrench3D consists of photogrammetrically derived 3D point clouds capturing detailed scenes of open trenches, revealing underground utilities.
3 PAPERS • 1 BENCHMARK
Synthetic Question Answering dataset in Serbian, acquired by automatic translation of SQuAD.
Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.
LLM-generated output for compiling PDDL-domains and problems (5 scenarios, 5 runs per scenario): Specs 5 scenarios 5 trials.json - original json file Domain.pddl and Problem.pddl for each run are parsed out of the original json file. Plan.json - plan (if succeeded generating it)
Provided in the linked paper.
The Drag100 dataset is introduced in the paper "GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models"¹. This dataset is a new contribution to the benchmarking of drag editing¹.
Overview nEMO is a simulated dataset of emotional speech in the Polish language. The corpus contains over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language. The corpus is available for free under the Creative Commons license (CC BY-NC-SA 4.0).
The landmark Cancer Genomics Program launched in 2006 has contributed immensely to the awareness of the importance of cancer genomics in our understanding of cancer over the past decade and has begun to change the way the disease is treated in clinic. A large number of mutations contribute to cancer and predicting the effects of mutations using in silico tools has become a frequently used approach, but the use of next-generation sequencing-based approaches in clinical diagnosis has also led to a considerable increase in data and a vast number of variants of uncertain significance that require further analysis and validation to achieve the development goals. These data cannot be analyzed simply by using the tools and techniques traditionally available to better understand the origin and evolution of cancer and therefore to achieve this goal, a cancer reference framework through modeling of genome sequencing data has been proposed for the systematic identification of representative drive
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
1 PAPER • 2 BENCHMARKS
This is a set of datasets containing three versions of data:
ChronoMagic with 2265 metamorphic time-lapse videos, each accompanied by a detailed caption.
LOGO is a multi-person long-form video dataset with frame-wise annotations on both action procedures and formations based on artistic swimming scenarios. It provides a potential for constructing an action quality assessment approach with the ability to model group information among actors. Longer video durations also challenge the ability of the method to aggregate long-term temporal information.
The $\text{BEAR}$ dataset and its larger version, $\text{BEAR}_{\text{big}}$, are benchmarks for evaluating common factual knowledge contained in language models.
3D design file repository for the Stickbug Robot a 6 armed holonomic precision pollination robot
This dataset contains both the artificial and real flower images of bramble flowers. The real images were taken with a realsense D435 camera inside the West Virginia University greenhouse. All the flowers are annotated in YOLO format with bounding box and class name. The trained weights after training also have been provided. They can be used with the python script provided to detect the bramble flowers. Also the classifier can classify whether the flowers center is visible or hidden which will be helpful in precision pollination projects. Images are also augmented to make the task robust in various environmental conditions.
This dataset comprises video files (converted into tif format) that depict glomerular activation in mice. The activation was recorded as the response for 35 monomolecular odors. Wide-field 1-photon calcium imaging was recorded at a framerate of 100 Hz, in Thy1-GCaMP6f mice implanted with cranial windows over the olfactory bulb. Mice were head-fixed during imaging, with monomolecular odors presented in a randomized sequence for 2 seconds apiece during each trial.
Demonstration video of the Stickbug Robot
Human face Deepfake dataset sampled from large datasets
RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them.
5 PAPERS • 1 BENCHMARK
The $\textbf{360+x}$ dataset is a large-scale database that emphasizes a comprehensive multifaceted understanding of daily scenes. It provides diverse viewpoints and data modalities to emulate how humans obtain daily information in real-world scenarios. It includes 232 scene examples, each with an average duration of 6.2 minutes, spanning across 28 scene categories (comprising 15 indoor scenes and 13 outdoor scenes).
This dataset endeavors to fill the research void by presenting a meticulously curated collection of misogynistic memes in a code-mixed language of Hindi and English. It introduces two sub-tasks: the first entails a binary classification to determine the presence of misogyny in a meme, while the second task involves categorizing the misogynistic memes into multiple labels, including Objectification, Prejudice, and Humiliation.
SPRIGHT is the first, large-scale vision-language dataset that focuses on spatial relationships. It contains ~6M images that have been re-captioned with a synthetic focus.
This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks of size 20,000 each generated in accordance to different combination of three mechanisms: fitness, aging and preferential attachment. The goal is to use machine learning to identify the combination of mechanisms that was used to create the network. The dataset includes static features from the literature and two version of our newly developed dynamic features. net
The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that aptly describe the relationship between the image and the corresponding text. These annotations provide valuable insights into the semantic connection between each pair's visual and textual elements.
5 PAPERS • 2 BENCHMARKS
MMStar is an elite vision-indispensable multi-modal benchmark comprising 1,500 meticulously selected samples. These samples are carefully balanced and purified, ensuring they exhibit visual dependency, minimal data leakage, and require advanced multi-modal capabilities. MMStar evaluates LVLMs across 6 core capabilities and 18 detailed axes.
2 PAPERS • NO BENCHMARKS YET
ImageNet-D contains 4835 test images featuring diverse backgrounds (3,764), textures (498), and materials (573). Generated by diffusion models, ImageNet-D achieves superior image fidelity and collection efficiency than prior studies. Evaluation results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60%.
Dataset Overview vanilla.csv: Represents the interactions without specific role-play instructions. boss.csv: Interactions where ChatGPT plays the role of a user's boss. classmate.csv: Interactions with ChatGPT acting as the user's classmate. Each turn was coded with user motives of user responses, or the perceived naturalness of ChatGPT responses.
xMIND is an open, large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND dataset using open-source neural machine translation (i.e., NLLB 3.3B).
DISL The full dataset report is available at: https://arxiv.org/abs/2403.16861
Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022
EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. Let me break it down for you:
FAUST-partial is a 3D registration benchmark dataset created to provide a more informative evaluation of 3D registration methods. The dataset addresses two main limitations of current 3D registration benchmarks:
5 PAPERS • 9 BENCHMARKS
IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.