RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.
1 PAPER • NO BENCHMARKS YET
RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.