Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

PDF Abstract

Results from the Paper

 Ranked #1 on on

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
# 1
Visual Question Answering (VQA) DocVQA test TILT-Large ANLS 0.8705 # 9
Visual Question Answering (VQA) DocVQA test TILT-Base ANLS 0.8392 # 14
Visual Question Answering (VQA) InfographicVQA TILT-Large ANLS 61.20 # 4
Document Image Classification RVL-CDIP TILT-Base Accuracy 95.25% # 11
Document Image Classification RVL-CDIP TILT-Large Accuracy 95.52% # 7