DocFormer: End-to-End Transformer for Document Understanding

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Document Image Classification RVL-CDIP DocFormerBASE Accuracy 96.17% # 1
Parameters 183M # 20
Document Image Classification RVL-CDIP DocFormer large Accuracy 95.50% # 6
Parameters 536M # 26


No methods listed for this paper. Add relevant methods here