DocILE Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

**DocILE** is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:

i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin

ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table

iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set

Source: [DocILE Benchmark for Document Information Localization and Extraction](https://arxiv.org/pdf/2302.05658v1.pdf)

Image Source : [https://arxiv.org/pdf/2302.05658v1.pdf](https://arxiv.org/pdf/2302.05658v1.pdf)

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/d9e74bc5-3fbc-4ab0-b21c-b17b9aed2284.png Clear

Change

---

DocILE

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

SROIE

DUDE

NAF

CORD

Usage

License

Modalities

Languages

DocILE

Benchmarks Edit Add a new result Link an existing benchmark