RVL-CDIP_MP (RVL-CDIP multi-page)

Introduced by Landeghem et al. in Beyond Document Page Classification: Design, Datasets, and Challenges

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

It has the same label taxonomy as RVL-CDIP (16) with close to 400K documents in PDF format, averaging 5 pages per document.

Homepage