no code implementations • WS 2018 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er
Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.
no code implementations • WS 2017 • Kaja Dobrovoljc, Toma{\v{z}} Erjavec, Simon Krek
We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2.
no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er
We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.
no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec
In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.
no code implementations • WS 2017 • Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}
In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.
no code implementations • LREC 2014 • Darja Fi{\v{s}}er, Ale{\v{s}} Tav{\v{c}}ar, Toma{\v{z}} Erjavec
The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora.
no code implementations • LREC 2012 • Toma{\v{z}} Erjavec
The paper presents a gold-standard reference corpus of historical Slovene containing 1, 000 sampled pages from over 80 texts, which were, for the most part, written between 1750-1900.
no code implementations • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er
In computer-mediated communication, Latin-based scripts users often omit diacritics when writing.
no code implementations • LREC 2020 • Andrej Pancur, Toma{\v{z}} Erjavec
The paper describes the process of acquisition, up-translation, encoding, annotation, and distribution of siParl, a collection of the parliamentary debates from the Assembly of the Republic of Slovenia from 1990{--}2018, covering the period from just before Slovenia became an independent country in 1991, and almost up to the present.
no code implementations • LREC 2020 • Simon Krek, {\v{S}}pela Arhar Holdt, Toma{\v{z}} Erjavec, Jaka {\v{C}}ibej, Andraz Repar, Polona Gantar, Nikola Ljube{\v{s}}i{\'c}, Iztok Kosem, Kaja Dobrovoljc
We describe a new version of the Gigafida reference corpus of Slovene.
1 code implementation • LREC 2014 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec
This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.
1 code implementation • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec
In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available.