🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

110 dataset results for Tabular AND English

Uniswap

Uniswap (Replication Data for: Uniswap Daily Transaction Indices by Network)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Volunteer task execution events in Galaxy Zoo and The Milky Way citizen science projects

Context of the data sets The Zooniverse platform (www.zooniverse.org) has successfully built a large community of volunteers contributing to citizen science projects. Galaxy Zoo and the Milky Way Project were hosted there.

1 PAPER • NO BENCHMARKS YET

WDC Block

WDC Block (WDC Block: A Blocking Benchmark)

WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.

1 PAPER • 3 BENCHMARKS

WHYSHIFT

In our benchmark WHYSHIFT, we explore distribution shifts on 5 real-world tabular datasets from the economic and traffic sectors with natural spatiotemporal distribution shifts.We only pick 7 typical settings out of 22 settings and select only one representative target domain for each setting. In our benchmark, we specify the distribution shift pattern for each setting, and we provide the tools to identify risky regions with large $Y|X$ shifts and to diagnose the performance degradation.

1 PAPER • NO BENCHMARKS YET

WebEdit

Fact-based Text Editing dataset based on WebNLG dataset.

1 PAPER • 1 BENCHMARK

WikiTableSet

WikiTableSet (Wikipedia Table Image Dataset)

WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even

1 PAPER • NO BENCHMARKS YET

Wikipedia Knowledge Graph dataset

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range

1 PAPER • NO BENCHMARKS YET

fake

fake (Real / Fake Job Posting Prediction)

[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

1 PAPER • 1 BENCHMARK

IEEE-CIS Fraud Detection

Can you detect fraud from customer transactions? Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

0 PAPER • NO BENCHMARKS YET

Rice Dataset Commeo and Osmancik

ata Set Name: Rice Dataset (Commeo and Osmancik) Abstract: A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

0 PAPER • NO BENCHMARKS YET

SMCOVID19-CT

SMCOVID19-CT (Contact Tracing Data (from Italian SM-COVID-19 App))

We present a real data analysis of a CT experiment that was conducted in Italy for 8 months and involved more than 100,000 CT app users.

0 PAPER • NO BENCHMARKS YET

SMDG (Standardized Multi-Channel Dataset for Glaucoma)

Standardized Multi-Channel Dataset for Glaucoma (SMDG-19) is a collection and standardization of 19 public datasets, comprised of full-fundus glaucoma images, associated image metadata like, optic disc segmentation, optic cup segmentation, blood vessel segmentation, and any provided per-instance text metadata like sex and age. This dataset is the largest public repository of fundus images with glaucoma.

0 PAPER • NO BENCHMARKS YET

The Reddit COVID Dataset

The Reddit COVID Dataset is a dataset of 4.51M Reddit posts and 17.8M comments - all mentions of COVID until 2021-10-25 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

0 PAPER • NO BENCHMARKS YET

X-Wines (A Wine Dataset for Recommender Systems and Machine Learning)

X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.

0 PAPER • NO BENCHMARKS YET

Datasets

110 dataset results for Tabular AND English