WikiTableSet (Wikipedia Table Image Dataset)

Introduced by Ly et al. in Rethinking Image-based Table Recognition Using Weakly Supervised Methods

WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even multilingual table recognition algorithms. You can click here for more details about this dataset.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

WikiTableSet (Wikipedia Table Image Dataset)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

PubTabNet

PubTables-1M

Usage

License

Modalities

Languages

WikiTableSet (Wikipedia Table Image Dataset)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

PubTabNet

PubTables-1M

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages