TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering (VQA)	DocVQA test	Donut	ANLS	0.675	# 28
Document Image Classification	RVL-CDIP	Donut	Accuracy	95.3%	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/donut-document-understanding-transformer/document-image-classification-on-rvl-cdip)](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip?p=donut-document-understanding-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/donut-document-understanding-transformer/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=donut-document-understanding-transformer)`

OCR-free Document Understanding Transformer

30 Nov 2021 · Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park ·

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. The code, trained model and synthetic data are available at https://github.com/clovaai/donut.

PDF Abstract