Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
144 PAPERS • 3 BENCHMARKS
OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the need for a unified technology that integrates OCR with various NLP tasks, especially semantic parsing. Since OCR and semantic parsing have been studied as separate tasks so far, the datasets for each task on their own are rich, while those for the integrated post-OCR parsing tasks are relatively insufficient. In this study, we publish a consolidated dataset for receipt parsing as the first step towards post-OCR parsing tasks. The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. The proposed dataset can be used to address various OCR and parsing tasks.
80 PAPERS • 1 BENCHMARK
We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.
3 PAPERS • 1 BENCHMARK
E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.
3 PAPERS • NO BENCHMARKS YET