PubTables-1M (PubMed Tables One Million)

Introduced by Smock et al. in PubTables-1M: Towards comprehensive table extraction from unstructured documents

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

460,589 annotated document pages containing tables for table detection.
947,642 fully annotated tables including text content and complete location (bounding box) information for table structure recognition and functional analysis.
Full bounding boxes in both image and PDF coordinates for all table rows, columns, and cells (including blank cells), as well as other annotated structures such as column headers and projected row headers.
Rendered images of all tables and pages.
Bounding boxes and text for all words appearing in each table and page image.
Additional cell properties not used in the current model training.

Additionally, cells in the headers are canonicalized and we implement multiple quality control steps to ensure the annotations are as free of noise as possible. For more details, please see our paper.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

microsoft/table-transformer

1,800

phamquiluan/table-transformer

Tasks

Similar Datasets

TabLeX

WTW

TNCR Dataset

PubTabNet

Usage

PubTables-1M (PubMed Tables One Million)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

TabLeX

WTW

TNCR Dataset

PubTabNet

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages