VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

5 Oct 2020  ·  Mohamed Kerroumi, Othmane Sayem, Aymen Shabou ·

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Document Layout Analysis RVL-CDIP VisualWordGrid FAR 28.7 # 1
WAR 18.7 # 1

Methods


No methods listed for this paper. Add relevant methods here