Introduced by Deng et al. in TURL: Table Understanding through Representation Learning

The WikiTables-TURL dataset was constructed by the authors of TURL and is based on the WikiTable corpus, which is a large collection of Wikipedia tables. The dataset consists of 580,171 tables divided into fixed training, validation and testing splits. Additionally, the dataset contains metadata about each table, such as the table name, table caption and column headers.

406,706 of these tables are annotated for the Column Type Annotation (CTA) task, 55,970 tables for the Columns Property Annotation (CPA) task and 200,744 tables for the Cell Entity Annotation (CEA) task. As classes for the CTA and CPA, Freebase's types and relations were used, whereas for the CEA task entities from Freebase were used. The table below lists the total annotated columns (or cells in the case of CEA) for each split and for each task as well as the number of classes used for annotation.

Training Validation Testing Classes
CTA 628,254 13,391 13,025 255
CPA 62,954 2,175 2,072 121
CEA 1,264,217 76,720 225,777 1,787,737

The authors have made the dataset and its variants publicly available for download.


