Transfer Learning for a Letter-Ngrams to Word Decoder in the Context of Historical Handwriting Recognition with Scarce Resources

COLING 2018 · Adeline Granet, Emmanuel Morin, Harold Mouch{\`e}re, Solen Quiniou, Christian Viard-Gaudin ·

Lack of data can be an issue when beginning a new study on historical handwritten documents. In order to deal with this, we present the character-based decoder part of a multilingual approach based on transductive transfer learning for a historical handwriting recognition task on Italian Comedy Registers. The decoder must build a sequence of characters that corresponds to a word from a vector of letter-ngrams. As learning data, we created a new dataset from untapped resources that covers the same domain and period of our Italian Comedy data, as well as resources from common domains, periods, or languages. We obtain a 97.42{\%} Character Recognition Rate and a 86.57{\%} Word Recognition Rate on our Italian Comedy data, despite a lexical coverage of 67{\%} between the Italian Comedy data and the training data. These results show that an efficient system can be obtained by a carefully selecting the datasets used for the transfer learning.