PubLayNet

Introduced by Zhong et al. in PubLayNet: largest dataset ever for document layout analysis

PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

Source: PubLayNet: largest dataset ever for document layout analysis

Homepage