This article focuses on the problem of identifying articles and recovering their text from within and across newspaper pages when OCR just delivers one text file per page.
We ask how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i. e., a big-data and a small-data scenario.
Detecting the similarity between job advertisements is important for job recommendation systems as it allows, for example, the application of item-to-item based recommendations.
Complex word identification (CWI) is an important task in text accessibility.
Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility.
In this paper, we propose an unsupervised method to identify noun sense changes based on rigorous analysis of time-varying text data available in the form of millions of digitized books.
This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008.