DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:

i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin

ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table

iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set

Source: DocILE Benchmark for Document Information Localization and Extraction

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages