12 dataset results for Optical Character Recognition (OCR) AND Texts

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

165 PAPERS • 2 BENCHMARKS

FUNSD (Form Understanding in Noisy Scanned Documents)

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.

141 PAPERS • 3 BENCHMARKS

ST-VQA (Scene Text Visual Question Answering)

ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.

72 PAPERS • NO BENCHMARKS YET

TextCaps

Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

63 PAPERS • 1 BENCHMARK

TextOCR

TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.

21 PAPERS • NO BENCHMARKS YET

Twitter100k

Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.

4 PAPERS • NO BENCHMARKS YET

MCSCSet

MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.

2 PAPERS • NO BENCHMARKS YET

MSDA (Multi-source domain adaptation dataset for text recognition)

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images

2 PAPERS • 2 BENCHMARKS

FICS PCB Image Collection (FPIC)

Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.

1 PAPER • NO BENCHMARKS YET

MatriVasha:

MatriVasha: (MatriVasha: Compound Character atasetD)

MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464‬ images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.

1 PAPER • NO BENCHMARKS YET

UTRSet-Real

The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.

1 PAPER • 1 BENCHMARK

UTRSet-Synth

The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

1 PAPER • NO BENCHMARKS YET

Datasets

12 dataset results for Optical Character Recognition (OCR) AND Texts