The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.
183 PAPERS • 2 BENCHMARKS
This dataset arises from the READ project (Horizon 2020).
6 PAPERS • 1 BENCHMARK
5 PAPERS • 1 BENCHMARK
Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.
4 PAPERS • 1 BENCHMARK
The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following:
Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both co
Digital Peter is a dataset of Peter the Great's manuscripts annotated for segmentation and text recognition. The dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models. It consists of 9,694 images and text files corresponding to lines in historical documents. The dataset includes Peter’s handwritten materials covering the period from 1709 to 1713.
3 PAPERS • 1 BENCHMARK
We introduce a new Dataset (BN-HTRd) for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 1,08,181 instances of handwritten words, distributed over 14,383 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting c
2 PAPERS • 2 BENCHMARKS
Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.
Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.
2 PAPERS • 1 BENCHMARK
The Belfort dataset This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.
1 PAPER • 1 BENCHMARK
MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464 images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.
1 PAPER • NO BENCHMARKS YET