KoKo: an L1 Learner Corpus for German

LREC 2014 · Andrea Abel, Aivars Glaznieks, Lionel Nicolas, Egon Stemle ·

We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80{\%} as well as high accuracies of transcriptions ({\textgreater}99{\%}), automatic tokenisation ({\textgreater}99{\%}), sentence splitting ({\textgreater}96{\%}) and POS-tagging ({\textgreater}94{\%}). The KoKo corpus will be published at the end of 2014. It will be the first accessible linguistically annotated German L1 learner corpus and a valuable source for research on L1 learner language as well as for teachers of German as L1, in particular with regards to writing skills.

PDF Abstract