The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.
165 PAPERS • 2 BENCHMARKS
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
141 PAPERS • 3 BENCHMARKS
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
72 PAPERS • NO BENCHMARKS YET
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
63 PAPERS • 1 BENCHMARK
TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.
21 PAPERS • NO BENCHMARKS YET
Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.
4 PAPERS • NO BENCHMARKS YET
MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.
2 PAPERS • NO BENCHMARKS YET
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
2 PAPERS • 2 BENCHMARKS
Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.
1 PAPER • NO BENCHMARKS YET
MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464 images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.
The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.
1 PAPER • 1 BENCHMARK
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.