ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
74 PAPERS • NO BENCHMARKS YET
The ICDAR2003 dataset is a dataset for scene text recognition. It contains 507 natural scene images (including 258 training images and 249 test images) in total. The images are annotated at character level. Characters and words can be cropped from the images.
51 PAPERS • 1 BENCHMARK
TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
21 PAPERS • NO BENCHMARKS YET
This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.
11 PAPERS • 1 BENCHMARK
This dataset aims at evaluating the License Plate Character Segmentation (LPCS) problem. The experimental results of the paper Benchmark for License Plate Character Segmentation were obtained using a dataset providing 101 on-track vehicles captured during the day. The video was recorded using a static camera in early 2015.
6 PAPERS • 1 BENCHMARK
This dataset, called RodoSol-ALPR dataset, contains 20,000 images captured by static cameras located at pay tolls owned by the Rodovia do Sol (RodoSol) concessionaire, which operates 67.5 kilometers of a highway (ES-060) in the Brazilian state of Espírito Santo.
5 PAPERS • NO BENCHMARKS YET
The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.
4 PAPERS • 1 BENCHMARK
Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.
4 PAPERS • NO BENCHMARKS YET
Imgur5k is a large-scale handwritten in-the-wild dataset, containing challenging real world handwritten samples from nearly 5K writers. It consists of ~135K handwritten English words from 5K different images. As opposed to existing dataests for OCR which have limited variability in their images, the images in Imgur5K contain a diverse set of styles.
3 PAPERS • NO BENCHMARKS YET
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
2 PAPERS • 2 BENCHMARKS
This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units in the Brazilian state of Paraná.
2 PAPERS • 1 BENCHMARK
This dataset contains 12,500 meter images acquired in the field by the employees of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units, across 395 cities and 1,113 locations (i.e., districts, villages and settlements), located in the Brazilian state of Paraná.
1 PAPER • 1 BENCHMARK
Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.
1 PAPER • NO BENCHMARKS YET
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.