🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

38 dataset results for Tables

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research over more complex and realistic tabular and textual data, especially those requiring numerical reasoning.

49 PAPERS • 1 BENCHMARK

GitTables

GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.

13 PAPERS • NO BENCHMARKS YET

SinD (A Drone Dataset at Signalized Intersection in China)

The SIND dataset is based on 4K video captured by drones, providing information including traffic participant trajectories, traffic light status, and high-definition maps

10 PAPERS • NO BENCHMARKS YET

VNAT

VNAT (VPN/NONVPN NETWORK APPLICATION TRAFFIC DATASET)

This dataset is a collection of labelled PCAP files, both encrypted and unencrypted, across 10 applications, as well as a pandas dataframe in HDF5 format containing detailed metadata summarizing the connections from those files. It was created to assist the development of machine learning tools that would allow operators to see the traffic categories of both encrypted and unencrypted traffic flows. In particular, features of the network packet traffic timing and size information (both inside of and outside of the VPN) can be leveraged to predict the application category that generated the traffic.

4 PAPERS • NO BENCHMARKS YET

eICU-CRD (eICU Collaborative Research Database)

The eICU Collaborative Research Database is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology.

4 PAPERS • NO BENCHMARKS YET

M5Product

The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

3 PAPERS • NO BENCHMARKS YET

SKAB (Skoltech Anomaly Benchmark)

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ evaluation. Each dataset represents a multivariate time series collected from the sensors installed on the testbed. All instances are labeled for evaluating the results of solving outlier detection and changepoint detection problems.

3 PAPERS • 2 BENCHMARKS

GIRT-Data (GitHub Issue Report Template Dataset)

GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.

2 PAPERS • NO BENCHMARKS YET

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 PAPERS • NO BENCHMARKS YET

Multivariate-Mobility-Paris

The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.

2 PAPERS • NO BENCHMARKS YET

SegmentedTables

The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption. Tables are annotated with types (leaderboard, ablation, irrelevant) and cells of relevant tables are annotated with semantic roles (such as “paper model”, “competing model”, “dataset”, “metric”).

2 PAPERS • NO BENCHMARKS YET

Acoustic Extinguisher Fire Dataset

Yavuz Selim TASPINAR, Murat KOKLU and Mustafa ALTIN

1 PAPER • NO BENCHMARKS YET

ArxivPapers

The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

1 PAPER • NO BENCHMARKS YET

Can you predict product backorder?

Problem Statement

1 PAPER • NO BENCHMARKS YET

DBFC Dataset (Single Direct Borohydride Fuel Cell Dataset)

This dataset includes Direct Borohydride Fuel Cell (DBFC) impedance and polarization test in anode with Pd/C, Pt/C and Pd decorated Ni–Co/rGO catalysts. In fact, different concentration of Sodium Borohydride (SBH), applied voltages and various anode catalysts loading with explanation of experimental details of electrochemical analysis are considered in data. Voltage, power density and resistance of DBFC change as a function of weight percent of SBH (%), applied voltage and amount of anode catalyst loading that are evaluated by polarization and impedance curves with using appropriate equivalent circuit of fuel cell. Can be stated that interpretation of electrochemical behavior changes by the data of related cell is inevitable, which can be useful in simulation, power source investigation and depth analysis in DB fuel cell researches.

1 PAPER • NO BENCHMARKS YET

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications

The dataset is generated from the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

1 PAPER • NO BENCHMARKS YET

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications version 1

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications version 1 (Version 1)

This repository contains the dataset for the study of the computational reproducibility of Jupyter notebooks from biomedical publications. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

1 PAPER • NO BENCHMARKS YET

Deep Sea Treasure Pareto-Front

The dataset contains two Pareto-fronts: - The Pareto-front for the 2-objective problem - The Pareto-front for the 3-objective problem

1 PAPER • NO BENCHMARKS YET

Industrial Benchmark Dataset for Customer Escalation Prediction

This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.

1 PAPER • NO BENCHMARKS YET

MMCode

MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.

1 PAPER • NO BENCHMARKS YET

MineralImage5k (Benchmark for 5k raw mineral species recognition)

We present a comprehensive dataset comprising a vast collection of raw mineral samples for the purpose of mineral recognition. The dataset encompasses more than 5,000 distinct mineral species and incorporates subsets for zero-shot and few-shot learning. In addition to the samples themselves, some entries in the dataset are accompanied by supplementary natural language descriptions, size measurements, and segmentation masks. For detailed information on each sample, please refer to the minerals_full.csv file.

1 PAPER • NO BENCHMARKS YET

Nelson-Plosser

Nelson-Plosser (Nelson-Plosser US Macroeconomic Time Series)

US Macroeconomic dataset containing 14 time series of monthly observations. They have various lengths but all end in 1988. The variables: consumer price index, industrial production, nominal GNP, velocity, employment, interest rate, nominal wages, GNP deflator, money stock, real GNP, stock prices (S&P500), GNP per capita, real wages, unemployment.

1 PAPER • NO BENCHMARKS YET

Notebook Inaccessibility

This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the Github Repository for this work.

1 PAPER • NO BENCHMARKS YET

PEM Fuel Cell Dataset (Proton Exchange Membrane (PEM) Fuel Cell Dataset)

This dataset are about Nafion 112 membrane standard tests and MEA activation tests of PEM fuel cell in various operation condition. Dataset include two general electrochemical analysis method, Polarization and Impedance curves. In this dataset, effect of different pressure of H2/O2 gas, different voltages and various humidity conditions in several steps are considered. Behavior of PEM fuel cell during distinct operation condition tests, activation procedure and different operation condition before and after activation analysis can be concluded from data. In Polarization curves, voltage and power density change as a function of flows of H2/O2 and relative humidity. Resistance of the used equivalent circuit of fuel cell can be calculated from Impedance data. Thus, experimental response of the cell is obvious in the presented data, which is useful in depth analysis, simulation and material performance investigation in PEM fuel cell researches.

1 PAPER • NO BENCHMARKS YET

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms

Poisoned Water Detection using Smartphone embedded WiFi CSI data and Machine Learning Algorithms (Dataset and machine learning algorithms to detect poisoned water from clean water via using Smartphone embedded Wi-Fi CSI data.)

This repository contains a dataset and machine learning algorithms to detect poisoned water from clean water via using equivalent Smartphone embedded Wi-Fi CSI data.

1 PAPER • NO BENCHMARKS YET

Replication Data for: Investigating the concentration of High Yield Investment Programs in the United Kingdom

The dataset provides information about 450 HYIPs collected between November 2020 and September 2021. This dataset was analyzed and the results are discussed in the paper.

1 PAPER • NO BENCHMARKS YET

SRSD-Feynman (Easy set)

Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.

1 PAPER • NO BENCHMARKS YET

SRSD-Feynman (Hard set)

1 PAPER • NO BENCHMARKS YET

SRSD-Feynman (Medium set)

1 PAPER • NO BENCHMARKS YET

Statcan Dialogue Dataset

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

1 PAPER • 1 BENCHMARK

Water Footprint Recommender System Data

It contains data from two different realities: Food.com, a well-known American recipe site, and Planeat, an Italian site that allows you to plan recipes to save food waste. The dataset is divided into two parts: embeddings, which can be used directly to execute the work and receive suggestions, and raw data, which must first be processed into embeddings.

1 PAPER • NO BENCHMARKS YET

bSDD (buildingSMART Data Dictionary)

The buildingSMART Data Dictionary (bSDD) is an online service that hosts classifications and their properties, allowed values, units and translations. The bSDD allows linking between all the content inside the database. It provides a standardized workflow to guarantee data quality and information consistency.

1 PAPER • NO BENCHMARKS YET

data_qe

data_qe (Federal Reserve Quantitative Easing Data)

This file contains the data and code for the publication "The Federal Reserve's Response to the Global Financial Crisis and Its Long-Term Impact: An Interrupted Time-Series Natural Experimental Analysis" by A. C. Kamkoum, 2023.

1 PAPER • NO BENCHMARKS YET

kaggle stroke Prediction competition

It is a competition on kaggle with stroke Prediction, which is heavily imbalanced.

1 PAPER • NO BENCHMARKS YET

washed_contract

Dataset contains about 48K contracts which are open source on Etherscan.

1 PAPER • NO BENCHMARKS YET

wildFireClimateChangeTweets

Here I provided the datasets I used for this analysis. It includes the tweets I streamed using the Tweepy package on Python during the peach of the wildfire season in late summer/early fall of 2020.

1 PAPER • NO BENCHMARKS YET

Rice Dataset Commeo and Osmancik

ata Set Name: Rice Dataset (Commeo and Osmancik) Abstract: A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

0 PAPER • NO BENCHMARKS YET

SheetCopilot

The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks involve diverse atomic actions related to six task categories (i.e. Entry and manipulation, Formatting, Management, Charts, Pivot Table, and Formula).

0 PAPER • 1 BENCHMARK

Datasets

38 dataset results for Tables