The ToughTables (2T) dataset was created for the SemTab challenge and includes 180 tables in total. The tables in this dataset can be categorized in two groups: the control (CTRL) group tables and tough (TOUGH) group tables.

The CTRL group contains 60 tables generated by querying the DBpedia SPARQL endpoint and tables collected from Wikipedia and their characteristic is that they are easy to annotate. The TOUGH group contains 120 tables mainly scraped from the web, some containing misspelled words and nicknames/homonyms and their characteristic is that they are hard to annotate. In both groups some tables were generated by the authors where they added noise to the collected tables.

The dataset was annotated for two tasks using DBpedia (DBP) types and entities and WikiData (WD): Column Type Annotation (CTA) and Cell Entity Annotation (CEA). In the table below the number of columns annotated for the CTA and number of cells annotated for the CEA task as well as the number of classes used are listed.

Annotations Classes
DBP-Column Type Annotation 540 39
DBP-Cell Entity Annotation 663,656 16,023
WD-Column Type Annotation 540 276
WD-Cell Entity Annotation 667,244 24,653

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages