SYN2015: Representative Corpus of Contemporary Written Czech

LREC 2016 · Michal K{\v{r}}en, V{\'a}clav Cvr{\v{c}}ek, Tom{\'a}{\v{s}} {\v{C}}apka, Anna {\v{C}}erm{\'a}kov{\'a}, Milena Hn{\'a}tkov{\'a}, Lucie Chlumsk{\'a}, Tom{\'a}{\v{s}} Jel{\'\i}nek, Dominika Kov{\'a}{\v{r}}{\'\i}kov{\'a}, Vladim{\'\i}r Petkevi{\v{c}}, Pavel Proch{\'a}zka, Hana Skoumalov{\'a}, Michal {\v{S}}krabal, Petr Trune{\v{c}}ek, Pavel Vond{\v{r}}i{\v{c}}ka, Adrian Jan Zasina ·

The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at http://kontext.korpus.cz as well as a dataset in shuffled format.