This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help table structure recognition and table data extraction. The dataset consists of 89,646 pages comprising 112,887 tables with cell structure annotated from IBM Research.
This dataset contains cell structure labels generated through token matching between the PDF and HTML version of each article. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations. These features are reflected in the dataset.
Source: IBM DeveloperPaper | Code | Results | Date | Stars |
---|