The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:
fiction compositions of pupils between 14-16 years of age newspaper articles (from the newspapers Népszabadság, Népszava, Magyar Hírlap, HVG) texts in informatics legal texts business and financial news The treebank exists in three versions:
Szeged Treebank 1.0 is annotated for noun phrases and clauses; Szeged Treebank 2.0 contains a deep phrase-structured syntactic analysis for all sentences; Szeged Dependency Treebank contains dependency-style annotation of all sentences. A morphologically reannotated version of the corpus, Szeged Corpus 2.5 has just been released, where participles, causative, frequentative and model verbs are distinctively marked, and unknown or misspelled words have been corrected, along with some minor morphological modifications. If you are interested in Szeged Corpus 2.5, please contact Veronika Vincze.