Correcting Whitespace Errors in Digitized Historical Texts
Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: it achieves 77{\%} recall at a false positive rate of less than 1{\%}, and 91{\%} recall at a false positive rate of less than 3{\%}.
PDF Abstract