Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent.
Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored.
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition.
The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer.
Rather than using hand-design state representation, we use a state representation that is being learned directly from the data by a DQN agent.
Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance.