Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image.
Understanding the scene is often essential for reading text in real-world scenarios.
This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge.
Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent.
Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored.
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition.
We present CREASE: Content Aware Rectification using Angle Supervision, the first learned method for document rectification that relies on the document's content, the location of the words and specifically their orientation, as hints to assist in the rectification process.
The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer.
This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design.
We propose a computational model for shape, illumination and albedo inference in a pulsed time-of-flight (TOF) camera.