DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts

International Conference on Frontiers in Handwriting Recognition 2016 · Fotini Simistira, Mathias Seuret, Nicole Eichenberger, Angelika Garz, Marcus Liwicki, Rolf Ingold ·

This paper introduces a publicly available historical manuscript database DIVA-HisDB for the evaluation of several Document Image Analysis (DIA) tasks. The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout analysis ground-truth which has been iterated on, reviewed, and refined by an expert in medieval studies. DIVA-HisDB and the ground truth can be used for training and evaluating DIA tasks, such as layout analysis, text line segmentation, binarization and writer identification. Layout analysis results of several representative baseline technologies are also presented in order to help researchers evaluate their methods and advance the frontiers of complex historical manuscripts analysis. An optimized state-of-the-art Convolutional AutoEncoder (CAE) performs with around 95 % accuracy, demonstrating that for this challenging layout there is much room for improvement. Finally, we show that existing text line segmentation methods fail due to interlinear and marginal text elements.

PDF Abstract