Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene.
Extensive experiments on standard benchmarks demonstrate that our end-to-end model achieves a new state-of-the-art for regular and irregular scene text recognition and needs 6 times shorter inference time than attentionbased methods.
Scene Text Recognition is a challenging problem because of irregular styles and various distortions.
Driven by deep learning and the large volume of data, scene text recognition has evolved rapidly in recent years.
Convolutional Recurrent Neural Networks (CRNNs) excel at scene text recognition.
In this paper, we study text recognition framework by considering the long-term temporal dependencies in the encoder stage.
Attention based scene text recognizers have gained huge success, which leverage a more compact intermediate representations to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture.
While each claim to have pushed the boundary of the technology, a holistic and fair comparison has been largely missing in the field due to the inconsistent choices of training and evaluation datasets.
Reading text in the wild is a very challenging task due to the diversity of text instances and the complexity of natural scenes.
Nonetheless, most of the previous methods may not work well in recognizing text with low resolution which is often seen in natural scene images.