TrOCR is an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the Transformer architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into $384 × 384$ and then the image is split into a sequence of 16 patches which are used as the input to image Transformers. Standard Transformer architecture with the self-attention mechanism is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.
Source: TrOCR: Transformer-based Optical Character Recognition with Pre-trained ModelsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Optical Character Recognition (OCR) | 7 | 33.33% |
Decoder | 2 | 9.52% |
Language Modeling | 2 | 9.52% |
Language Modelling | 2 | 9.52% |
Handwritten Text Recognition | 2 | 9.52% |
Image Generation | 1 | 4.76% |
Image Captioning | 1 | 4.76% |
Large Language Model | 1 | 4.76% |
Adversarial Attack | 1 | 4.76% |