The siParl corpus of Slovene parliamentary proceedings
The paper describes the process of acquisition, up-translation, encoding, annotation, and distribution of siParl, a collection of the parliamentary debates from the Assembly of the Republic of Slovenia from 1990{--}2018, covering the period from just before Slovenia became an independent country in 1991, and almost up to the present. The entire corpus, comprising over 8 thousand sessions, 1 million speeches and 200 million words was uniformly encoded in accordance with the TEI-based Parla-CLARIN schema for encoding corpora of parliamentary debates, and contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. The corpus was also part-of-speech tagged and lemmatised using state-of-the-art tools. The corpus is maintained on GitHub with its major versions archived in the CLARIN.SI repository and is available for linguistic analysis in the scope of the on-line CLARIN.SI concordancers, thus offering an invaluable resource for scholars studying Slovenian political history.
PDF Abstract