TableBank Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

To address the need for a standard open domain table benchmark dataset, the author propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, one can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.

The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.

Source: [TableBank](https://github.com/doc-analysis/TableBank)

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/Screenshot_2021-01-12_at_09.48.34.png Clear

Change

---

TableBank

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

HJDataset

PRImA

STDW

PubTabNet

Usage

License

Modalities

Languages