🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

20 dataset results for Handwriting Recognition

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

168 PAPERS • 2 BENCHMARKS

DeepWriting

A new dataset of handwritten text with fine-grained annotations at the character level and report results from an initial user evaluation.

9 PAPERS • NO BENCHMARKS YET

RIMES

RIMES (Reconnaissance & Indexation de données Manuscrites et de fac similÉS / Recognition & Indexing of handwritten documents & faxes)

The RIMES database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexing of handwritten documents and faxes) was created to evaluate automatic systems of recognition and indexing of handwritten letters. Of particular interest are cases such as those sent by postal mail or fax by individuals to companies or administrations.

7 PAPERS • NO BENCHMARKS YET

Bentham (Bentham project)

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.

4 PAPERS • 1 BENCHMARK

HKR (Handwritten Kazakh and Russian (HKR) Database for Text Recognition)

The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following:

4 PAPERS • 1 BENCHMARK

READ 2016 (HTR Dataset ICFHR 2016)

This dataset arises from the READ project (Horizon 2020).

4 PAPERS • 1 BENCHMARK

BanglaLekha-Isolated

This dataset contains Bangla handwritten numerals, basic characters and compound characters. This dataset was collected from multiple geographical location within Bangladesh and includes sample collected from a variety of aged groups. This dataset can also be used for other classification problems i.e: gender, age, district.

3 PAPERS • 2 BENCHMARKS

KOHTD (Kazakh Offline Handwritten Text Dataset)

Kazakh offline Handwritten Text dataset (KOHTD) has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning.

3 PAPERS • 1 BENCHMARK

Konzil (Konzilsprotokolle_C)

Konzil dataset was created by specialists of the University of Greifswald. It contains manuscripts written in modern German. Train sample consists of 353 lines, validation - 29 lines and test - 87 lines.

3 PAPERS • NO BENCHMARKS YET

Patzig

Patzig contains handwritten texts written in modern German. Train sample consists of 485 lines, validation - 38 lines and test -118 lines.

3 PAPERS • NO BENCHMARKS YET

Ricordi

Ricordi contains handwritten texts written in Italian. Train sample consists of 295 lines, validation - 19 lines and test - 69 lines.

3 PAPERS • NO BENCHMARKS YET

Schiller (Shiller)

Schiller contains handwritten texts written in modern German. Train sample consists of 244 lines, validation - 21 lines and test - 63 lines.

3 PAPERS • NO BENCHMARKS YET

Schwerin

Schwerin contains handwritten texts written in medieval German. Train sample consists of 793 lines, validation - 68 lines and test - 196 lines.

3 PAPERS • NO BENCHMARKS YET

BN-HTRd (BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR))

We introduce a new Dataset (BN-HTRd) for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 1,08,181 instances of handwritten words, distributed over 14,383 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting c

2 PAPERS • 2 BENCHMARKS

BRUSH (Brown University Stylus Handwriting)

The BRUSH dataset (BRown University Stylus Handwriting) contains 27,649 online handwriting samples from a total of 170 writers. Every sequence is labeled with intended characters such that dataset users can identify to which character a point in a sequence corresponds. The dataset was introduced in the paper "Generating Handwriting via Decoupled Style Descriptors" by Atsunobu Kotani, Stefanie Tellex, James Tompkin from Brown University, presented at European Conference on Computer Vision (ECCV) 2020.

2 PAPERS • NO BENCHMARKS YET

Saint Gall

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.

2 PAPERS • 1 BENCHMARK

An extensive dataset of handwritten central Kurdish isolated characters (Rebin M. Ahmed)

Data collection: Finding a suitable source of data is considered a first step toward building a database. The first step in building a database is finding a suitable source. Here, the main goal is to collect images of Kurdish handwritten characters written by many writers. So, a form is designed to do so. The form is shown in Figure 1. It consists of 1 alphabet at a time letter that has been printed on the top right corner, and it has 125 empty blocks. The writers have been asked to write each letter three times in the three empty blocks. The total number of writers is 390. The forms have been distributed among two main categories: The academic staff of the Information Technology department at Tishk International University, the university students of the University of Kurdistan-Hawler, Salahaddin University, and Tishk International University As shown in Table 2. In total there were ten sets of forms, each set with 35 forms for 35 different letters, at first, we decided that nine sets

1 PAPER • 1 BENCHMARK

Calliar

Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic calligraphy.

1 PAPER • NO BENCHMARKS YET

MatriVasha:

MatriVasha: (MatriVasha: Compound Character atasetD)

MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464‬ images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.

1 PAPER • NO BENCHMARKS YET

DigiLeTs (Digit- and Letter Trajectories)

A dataset with $23\,870$ digital trajectories (i.e. time series) of handwritten lower- and uppercase Latin letters and Arabic numbers ($a$-$z$, $A$-$Z$, $0$-$9$), generated by $77$ experts using a Wacom Pen Tablet. An expert is considered a proficient user of the recorded symbols, in this case adult native German speakers.

0 PAPER • NO BENCHMARKS YET

Datasets

20 dataset results for Handwriting Recognition