RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.
It has the same label taxonomy as RVL-CDIP (16) with close to 400K documents in PDF format, averaging 5 pages per document.
Paper | Code | Results | Date | Stars |
---|