RVL-CDIP_MP (RVL-CDIP multi-page)

Introduced by Landeghem et al. in Beyond Document Page Classification: Design, Datasets, and Challenges

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

It has the same label taxonomy as RVL-CDIP (16) with close to 400K documents in PDF format, averaging 5 pages per document.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • apache 2.0

Modalities


Languages