To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization.
Generally pre-training and long-time training computation are necessary for obtaining a good-performance text detector based on deep networks.
Binary neural network leverages the $Sign$ function to binarize real values, and its non-derivative property inevitably brings huge gradient errors during backpropagation.
Ranked #1 on Binarization on ImageNet (Top 1 Accuracy metric)
In this paper, we present a simple yet effective data-free quantization method with accurate activation clipping and adaptive batch normalization.
Recent video text spotting methods usually require the three-staged pipeline, i. e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results.
Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video.
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data.
Extra rich non-paired single-modal text data is used for boosting the generalization of text branch.
For example, without using polygon annotations, PSENet achieves an 80. 5% F-score on TotalText  (vs. 80. 9% of fully supervised counterpart), 31. 1% better than training directly with upright bounding box annotations, and saves 80%+ labeling costs.
To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain).
Sentence matching is an essential task in the QA systems and is usually reformulated as a Paraphrase Identification (PI) problem.
Ranked #13 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)
In this paper, we propose a pixel-wise method named TextCohesion for scene text detection, which splits a text instance into five key components: a Text Skeleton and four Directional Pixel Regions.
Ranked #1 on Curved Text Detection on SCUT-CTW1500