Monolingual corpus creation and evaluation of truly low-resource languages from Peru
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for corpus creation considers multiple filtering steps, and focuses on educational PDF documents. Throughout an evaluation based on language modelling and character-level perplexity, we determine that our method allows the creation of clean monolingual corpora to support further Natural Language Processing (NLP) tasks in four languages.
PDF Abstract