🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

171 dataset results for Tabular

Trust Dynamics and Market Behavior in Cryptocurrency

Trust Dynamics and Market Behavior in Cryptocurrency (Trust Dynamics and Market Behavior in Cryptocurrency: A Comparative Study of Centralized and Decentralized Exchanges)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

DrivAerNet

DrivAerNet (A Parametric Car Dataset for Data-driven Aerodynamic Design and Graph-Based Drag Prediction)

DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.

1 PAPER • NO BENCHMARKS YET

GRD-TRT-BUF-4I Technical Validation Data

This is the static test data from the study "Global Geolocated Realtime Data of Interfleet Urban Transit Bus Iding" collected by GRD-TRT-BUF-4I. test-data-a.csv was collected from December 31, 2023 00:01:30 UTC to January 1, 2024 00:01:30 UTC. test-data-b.csv was collected from January 4, 2024 01:30:30 UTC to January 5, 2024 01:30:30 UTC. test-data-c.csv was collected from January 10, 2024 16:05:30 UTC to January 11, 2024 16:05:30 UTC.

1 PAPER • NO BENCHMARKS YET

Dataset: Privacy-Preserving Gaze Data Streaming in Immersive Interactive Virtual Reality: Robustness and User Experience.

Collected data from two distinct experiments in immersive, interactive VR where participants performed dynamic tasks as their eye, head, and hand movements were recorded. In the second experiment, a range of real-time privacy mechanisms are applied to eye gaze in real-time.

1 PAPER • NO BENCHMARKS YET

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

1 PAPER • NO BENCHMARKS YET

Participatory Budgeting Preferences Data Set

The data set includes information about 120+ elections (configuration settings and descriptive statistics), projects and 125k+ anonymized voters and their budget preferences. Preferences were sollicited with different elicitation methods (K-approval, knapsack, K-ranking and K-token). For some elections, voters provided also preferences under a secondary elicitation method, resulting in vote pairs from the same voter on the same budgeting question but with a different elicitation method.

1 PAPER • NO BENCHMARKS YET

Uniswap

Uniswap (Replication Data for: Uniswap Daily Transaction Indices by Network)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Heteroatom Doped Graphene Supercapacitor

Heteroatom doped graphene supercapacitor feature data is gathered from various literatures for use in machine learning tasks. Main motivation is to optimize supercapacitors and to gain knowledge into models for electrochemistry tasks.

1 PAPER • NO BENCHMARKS YET

Supplementary Material

Supplementary Material (Annotation Table of Review)

The file contains an annotated list of papers that are included in the literature survey.

1 PAPER • NO BENCHMARKS YET

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on

1 PAPER • NO BENCHMARKS YET

BlendedICU, the first harmonized, international intensive care dataset

Objective This study introduces the BlendedICU dataset, a massive dataset of international intensive care data. This dataset aims to facilitate generalizability studies of machine learning models, as well as statistical studies of clinical practices in the intensive care units.

0 PAPER • NO BENCHMARKS YET

List of OWL reasoners

CSV file with a list of all examined OWL reasoners. For each item, information on usability and maintenance status, project pages, source code repositories and related documentation was gathered.

1 PAPER • NO BENCHMARKS YET

Multi-Labelled SMILES Odors dataset

This dataset is a multi-labelled SMILES odor dataset with 138 odor descriptors. This dataset was created for replicating the paper: A principal odor map unifies diverse tasks in olfactory perception.

1 PAPER • 1 BENCHMARK

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications

The dataset is generated from the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

1 PAPER • NO BENCHMARKS YET

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

1 PAPER • NO BENCHMARKS YET

WHYSHIFT

In our benchmark WHYSHIFT, we explore distribution shifts on 5 real-world tabular datasets from the economic and traffic sectors with natural spatiotemporal distribution shifts.We only pick 7 typical settings out of 22 settings and select only one representative target domain for each setting. In our benchmark, we specify the distribution shift pattern for each setting, and we provide the tools to identify risky regions with large $Y|X$ shifts and to diagnose the performance degradation.

1 PAPER • NO BENCHMARKS YET

Multicenter dataset of simulated neuroimaging features - quadratic relationship with age

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/8119042#.ZK-jJC9BxhE

1 PAPER • NO BENCHMARKS YET

Genre2Movies

Genre2Movies (Compositional queries for Movie recommendation)

Genre annotations for movies The file genre2movies.csv contains genre-movie tuples based on Wikidata annotations (https://www.wikidata.org/).

1 PAPER • NO BENCHMARKS YET

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 PAPERS • NO BENCHMARKS YET

ATMs fault prediction

The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.

1 PAPER • NO BENCHMARKS YET

SMDG (Standardized Multi-Channel Dataset for Glaucoma)

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is the largest public repository of fundus images with glaucoma.

0 PAPER • NO BENCHMARKS YET

Dataset of Paper Corpus

Overview of the scoping review paper corpus, sorted by their diferent intent types, categories, and subcategories. Note: Papers (77) may include multiple unique intents (172) and can therefore appear in multiple categories and subcategories.

1 PAPER • NO BENCHMARKS YET

Multicenter dataset of neuroimaging features (part I)

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/7845311#.ZK-jty9BxhE

1 PAPER • NO BENCHMARKS YET

Multicenter dataset of neuroimaging features (part II)

A detailed description of this dataset can be found in the Zenodo repository: https://zenodo.org/record/7845361#.ZK-k7y9BxhE

1 PAPER • NO BENCHMARKS YET

TAP

TAP (Traffic Accident Prediction data repository)

The Traffic Accident Prediction (TAP) data repository offers extensive coverage for 1,000 US cities (TAP-city) and 49 states (TAP-state), providing real-world road structure data that can be easily used for graph-based machine learning methods such as Graph Neural Networks. Additionally, it features multi-dimensional geospatial attributes, including angular and directional features, that are useful for analyzing transportation networks. The TAP repository has the potential to benefit the research community in various applications, including traffic crash prediction, road safety analysis, and traffic crash mitigation. The datasets can be accessed in the TAP-city and TAP-state directories.

2 PAPERS • NO BENCHMARKS YET

GIRT-Data (GitHub Issue Report Template Dataset)

GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.

2 PAPERS • NO BENCHMARKS YET

Uncertainty and Concept Drift (On the Connection between Concept Drift and Uncertainty in Industrial Artificial Intelligence)

AI-based digital twins are at the leading edge of theIndustry 4.0 revolution, which are technologically empowered bythe Internet of Things and real-time data analysis. Information collected from industrial assets is produced in a continuous fashion, yielding data streams that must be processed under stringent timing constraints. Such data streams are usually subject to non-stationary phenomena, causing that the data distribution of the streams may change, and thus the knowledge captured by models used for data analysis may become obsolete (leading to the so-called concept drift effect). The early detection of thechange (drift) is crucial for updating the model’s knowledge, which is challenging especially in scenarios where the ground truth associated to the stream data is not readily available. Among many other techniques, the estimation of the model’s confidence has been timidly suggested in a few studies as a criterion for detecting drifts in unsupervised settings. The goal of this m

1 PAPER • NO BENCHMARKS YET

WDC Block

WDC Block (WDC Block: A Blocking Benchmark)

WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.

1 PAPER • 3 BENCHMARKS

Regensburg Pediatric Appendicitis Dataset

This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. Multiple abdominal B-mode ultrasound images were acquired for most patients, with the number of views varying from 1 to 15. The images depict various regions of interest, such as the abdomen’s right lower quadrant, appendix, intestines, lymph nodes and reproductive organs. Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). The study was approved by the Ethics Committee of the University of Regensburg (

1 PAPER • NO BENCHMARKS YET

WikiTableSet

WikiTableSet (Wikipedia Table Image Dataset)

WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even

1 PAPER • NO BENCHMARKS YET

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms (Dataset and machine learning algorithms to detect poisoned water from clean water via using Smartphone embedded Wi-Fi CSI data.)

This repository contains a dataset and machine learning algorithms to detect poisoned water from clean water via using equivalent Smartphone embedded Wi-Fi CSI data.

1 PAPER • NO BENCHMARKS YET

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are

3 PAPERS • 3 BENCHMARKS

X-Wines (A Wine Dataset for Recommender Systems and Machine Learning)

X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.

0 PAPER • NO BENCHMARKS YET

Binette's 2022 Inventors Benchmark

Hand-disambiguation of a sample of U.S. patents inventor mentions from PatentsView.org.

1 PAPER • NO BENCHMARKS YET

Harmonized US National Health and Nutrition Examination Survey (NHANES) 1988-2018

The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20

1 PAPER • NO BENCHMARKS YET

iV2V and iV2I+ (AI4Mobile Industrial Wireless Datasets: iV2V and iV2I+)

This dataset provides wireless measurements from two industrial testbeds: iV2V (industrial Vehicle-to-Vehicle) and iV2I+ (industrial Vehicular-to-Infrastructure plus sensor).

1 PAPER • NO BENCHMARKS YET

Berlin V2X

The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.

1 PAPER • NO BENCHMARKS YET

BAF

BAF (Bank Account Fraud)

Bank Account Fraud (BAF) is a large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset.

4 PAPERS • NO BENCHMARKS YET

WyzeRule

Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.

2 PAPERS • NO BENCHMARKS YET

ICLR Database (ICLR Database (with Textual Covariates))

A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.

1 PAPER • NO BENCHMARKS YET

OTTO Recommender Systems Dataset

The OTTO session dataset is a large-scale dataset intended for multi-objective recommendation research. We collected the data from anonymized behavior logs of the OTTO webshop and the app. The mission of this dataset is to serve as a benchmark for session-based recommendations and foster research in the multi-objective and session-based recommender systems area. We also launched a Kaggle competition with the goal to predict clicks, cart additions, and orders based on previous events in a user session.

1 PAPER • NO BENCHMARKS YET

RGZ EMU: Semantic Taxonomy

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 PAPER • NO BENCHMARKS YET

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github

2 PAPERS • 2 BENCHMARKS

Wikipedia Knowledge Graph dataset

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range

1 PAPER • NO BENCHMARKS YET

HumSet

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 PAPERS • NO BENCHMARKS YET

standard atomic contexts (standard contexts for the lattices of atomic lattices)

The dataset contains standard contexts of the lattices of all atomic lattices in the Concept Explorer format.

1 PAPER • NO BENCHMARKS YET

OPFLearnData

OPFLearnData (OPFLearnData: Dataset for Learning AC Optimal Power Flow)

The datasets are resulting from OPFLearn.jl, a Julia package for creating AC OPF datasets. The package was developed to provide researchers with a standardized way to efficiently create AC OPF datasets that are representative of more of the AC OPF feasible load space compared to typical dataset creation methods. The OPFLearn dataset creation method uses a relaxed AC OPF formulation to reduce the volume of the unclassified input space throughout the dataset creation process. The dataset contains load profiles and their respective optimal primal and dual solutions. Load samples are processed using AC OPF formulations from PowerModels.jl. More information on the dataset creation method can be found in our publication, "OPF-Learn: An Open-Source Framework for Creating Representative AC Optimal Power Flow Datasets" and in the package website: https://github.com/NREL/OPFLearn.jl.

1 PAPER • NO BENCHMARKS YET