PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
121 PAPERS • 1 BENCHMARK
The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.
107 PAPERS • 3 BENCHMARKS
A benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the \LaTeX{} documents available on the arXiv.com.
35 PAPERS • NO BENCHMARKS YET
The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout analysis ground-truth which has been iterated on, reviewed, and refined by an expert in medieval studies.
15 PAPERS • 2 BENCHMARKS
The DSSE-200 is a complex document layout dataset including various dataset styles. The dataset contains 200 images from pictures, PPT, brochure documents, old newspapers and scanned documents.
8 PAPERS • NO BENCHMARKS YET
HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.
5 PAPERS • NO BENCHMARKS YET
We present the VIS30K dataset, a collection of 29,689 images that represents 30 years of figures and tables from each track of the IEEE Visualization conference series (Vis, SciVis, InfoVis, VAST). VIS30K’s comprehensive coverage of the scientific literature in visualization not only reflects the progress of the field but also enables researchers to study the evolution of the state-of-the-art and to find relevant work based on graphical content. We describe the dataset and our semi-automatic collection process, which couples convolutional neural networks (CNN) with curation. Extracting figures and tables semi-automatically allows us to verify that no images are overlooked or extracted erroneously. To improve quality further, we engaged in a peer-search process for high-quality figures from early IEEE Visualization papers.
The D4LA dataset is a diverse benchmark for document layout analysis (DLA) derived from the RVL-CDIP dataset. It focuses on 12 document types with rich layouts, each represented by approximately 1,000 manually annotated images, while filtering out noisy, handwritten, artistic, or text-scarce images. The dataset defines 27 detailed layout categories, including DocTitle, ListText, Header, Table, Equation, and Footer, among others, catering to real-world applications.
3 PAPERS • 1 BENCHMARK
U-DIADS-Bib is a proprietary dataset developed through the collaboration of computer scientists and humanities at the University of Udine. It is composed of 200 images, 50 for each of the 4 different manuscripts that characterize it. These handwritten books were selected in collaboration with humanist partners considering both the complexity of their layout and the presence of significant and semantically distinguishable elements. In particular, the images of the four manuscripts were collected from the digital library Gallica. All manuscripts are Latin and Syriac Bibles published between the 6th and 12th centuries A.D.
2 PAPERS • 1 BENCHMARK
Revision: v1.0.0-full-20210527a DOI: 10.5281/zenodo.4817662 Authors: J. Chazalon, E. Carlinet, Y. Chen, J. Perret, C. Mallet, B. Duménieu and T. Géraud Official competition website: https://icdar21-mapseg.github.io/
1 PAPER • NO BENCHMARKS YET
We compiled a new dataset (the PERO layout dataset) that contains 683 images from various sources and historical periods with complete manual text block, text line polygon and baseline annotations. The included documents range from handwritten letters to historic printed books and newspapers and contain various languages including Arabic and Russian. Part of the PERO dataset was collected from existing datasets and extended with additional layout annotations (cBAD, IMPACT and BADAM). The dataset is split into 456 training and 227 testing images.
The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.
1 PAPER • 1 BENCHMARK
opendateset
0 PAPER • NO BENCHMARKS YET