Scene Text Detection
91 papers with code • 9 benchmarks • 15 datasets
Scene Text Detection is a computer vision task that involves automatically identifying and localizing text within natural images or videos. The goal of scene text detection is to develop algorithms that can robustly detect and and label text with bounding boxes in uncontrolled and complex environments, such as street signs, billboards, or license plates.
Source: ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection
Libraries
Use these libraries to find Scene Text Detection models and implementationsDatasets
Latest papers
DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond
In this report, we introduce DocXChain, a powerful open-source toolchain for document parsing, which is designed and developed to automatically convert the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations that are readable and manipulable by machines.
Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance
The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions.
STEP -- Towards Structured Scene-Text Spotting
We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression.
MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Detecting small scene text instances in the wild is particularly challenging, where the influence of irregular positions and nonideal lighting often leads to detection errors.
Turning a CLIP Model into a Scene Text Spotter
Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26. 5% and 5. 5% for text detection and spotting tasks, respectively.
SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression
In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module.
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network
Next, we propose a dual assignment scheme for speed acceleration.
ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively.
DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting
In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously.
Turning a CLIP Model into a Scene Text Detector
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.