In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods.
Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs.
Ranked #10 on Document Image Classification on RVL-CDIP
Compared to previous works, our method shows better or comparable performance on dense prediction fine-tuning tasks.
On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout.
Ranked #3 on Relation Extraction on FUNSD
Although the recent advance in OCR enables the accurate extraction of text segments, it is still challenging to extract key information from documents due to the diversity of layouts.