DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:
i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin
ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table
iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set
Source: DocILE Benchmark for Document Information Localization and ExtractionPaper | Code | Results | Date | Stars |
---|