Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription.
In this paper, we firstly introduce MAE to pathological image analysis.
MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism.
Ranked #2 on Scene Text Recognition on IIIT5k
Nowadays U-net-like FCNs predominate various biomedical image segmentation applications and attain promising performance, largely due to their elegant architectures, e. g., symmetric contracting and expansive paths as well as lateral skip-connections.
Considering scene image has large variation in text and background, we further design a modality-transform block to effectively transform 2D input images to 1D sequences, combined with the encoder to extract more discriminative features.
Second, we further extend bMS to a more general form, namely contrastive binary mean shift (cbMS), which maximizes the contrastive density in binary space, for finding informative patterns that are both frequent and discriminative for the dataset.