Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).
1 PAPER • NO BENCHMARKS YET
The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:
4 PAPERS • NO BENCHMARKS YET
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
177 PAPERS • 49 BENCHMARKS
iLur News Texts is a dataset of over 12000 news articles from iLur.am, categorized into 7 classes: sport, politics, weather, economy, accidents, art, society. The articles are split into train (2242k tokens) and test sets (425k tokens).