DTrOCR: Decoder-only Transformer for Optical Character Recognition

30 Aug 2023  ·  Masato Fujitake ·

Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Optical Character Recognition (OCR) Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study DTrOCR 105M Accuracy (%) 89.6 # 1
Optical Character Recognition (OCR) Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study DTrOCR Accuracy (%) 89.6 # 1
Scene Text Recognition CUTE80 DTrOCR 105M Accuracy 99.1 # 6
Handwritten Text Recognition IAM DTrOCR 105M CER 2.38 # 1
Scene Text Recognition ICDAR2013 DTrOCR 105M Accuracy 99.4 # 2
Scene Text Recognition ICDAR2015 DTrOCR 105M Accuracy 93.5 # 1
Scene Text Recognition IIIT5k DTrOCR 105M Accuracy 99.6 # 1
Task 2 SROIE DTrOCR 105M F1 98.37 # 1
Scene Text Recognition SVT DTrOCR 105M Accuracy 98.9 # 2
Scene Text Recognition SVTP DTrOCR 105M Accuracy 98.6 # 1

Methods