The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:
4 PAPERS • NO BENCHMARKS YET
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).
1 PAPER • NO BENCHMARKS YET