Search Results for author: Toma{\v{z}} Erjavec

Found 17 papers, 2 papers with code

The siParl corpus of Slovene parliamentary proceedings

no code implementations LREC 2020 Andrej Pancur, Toma{\v{z}} Erjavec

The paper describes the process of acquisition, up-translation, encoding, annotation, and distribution of siParl, a collection of the parliamentary debates from the Assembly of the Republic of Slovenia from 1990{--}2018, covering the period from just before Slovenia became an independent country in 1991, and almost up to the present.

Translation

Datasets of Slovene and Croatian Moderated News Comments

no code implementations WS 2018 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.

General Classification

Language-independent Gender Prediction on Twitter

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.

Gender Prediction General Classification

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

no code implementations WS 2017 Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.

General Classification

The Universal Dependencies Treebank for Slovenian

no code implementations WS 2017 Kaja Dobrovoljc, Toma{\v{z}} Erjavec, Simon Krek

We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2.

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.

Domain Adaptation Lemmatization +2

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

1 code implementation LREC 2016 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec

In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available.

sloWCrowd: A crowdsourcing tool for lexicographic tasks

no code implementations LREC 2014 Darja Fi{\v{s}}er, Ale{\v{s}} Tav{\v{c}}ar, Toma{\v{z}} Erjavec

The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora.

TweetCaT: a tool for building Twitter corpora of smaller languages

1 code implementation LREC 2014 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.

Language Identification

The goo300k corpus of historical Slovene

no code implementations LREC 2012 Toma{\v{z}} Erjavec

The paper presents a gold-standard reference corpus of historical Slovene containing 1, 000 sampled pages from over 80 texts, which were, for the most part, written between 1750-1900.

Lemmatization Optical Character Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.