Search Results for author: Toma{\v{z}} Erjavec

Found 17 papers, 2 papers with code

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

1 code implementation • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec

In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available.

Paper
Code

TweetCaT: a tool for building Twitter corpora of smaller languages

1 code implementation • LREC 2014 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.

Language Identification

Paper
Code

Datasets of Slovene and Croatian Moderated News Comments

no code implementations • WS 2018 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.

General Classification

Paper
Add Code

The Universal Dependencies Treebank for Slovenian

no code implementations • WS 2017 • Kaja Dobrovoljc, Toma{\v{z}} Erjavec, Simon Krek

We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2.

Paper
Add Code

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.

Domain Adaptation Lemmatization +2

Paper
Add Code

Language-independent Gender Prediction on Twitter

no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.

Gender Prediction General Classification

Paper
Add Code

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

no code implementations • WS 2017 • Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.

General Classification

Paper
Add Code

CLARIN's Key Resource Families

no code implementations • LREC 2018 • Darja Fi{\v{s}}er, Jakob Lenardi{\v{c}}, Toma{\v{z}} Erjavec

Paper
Add Code

Modernizing historical Slovene words with character-based SMT

no code implementations • WS 2013 • Yves Scherrer, Toma{\v{z}} Erjavec

Lemmatization Transliteration

Paper
Add Code

Lexicon Construction and Corpus Annotation of Historical Language with the CoBaLT Editor

no code implementations • WS 2012 • Tom Kenter, Toma{\v{z}} Erjavec, Maja {\v{Z}}orga Dulmin, Darja Fi{\v{s}}er

Paper
Add Code

sloWCrowd: A crowdsourcing tool for lexicographic tasks

no code implementations • LREC 2014 • Darja Fi{\v{s}}er, Ale{\v{s}} Tav{\v{c}}ar, Toma{\v{z}} Erjavec

The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora.

Paper
Add Code

The goo300k corpus of historical Slovene

no code implementations • LREC 2012 • Toma{\v{z}} Erjavec

The paper presents a gold-standard reference corpus of historical Slovene containing 1, 000 sampled pages from over 80 texts, which were, for the most part, written between 1750-1900.

LEMMA Lemmatization +1

Paper
Add Code

Predicting the Level of Text Standardness in User-generated Content

no code implementations • RANLP 2015 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Jaka {\v{C}}ibej, Dafne Marko, Senja Pollak, Iza {\v{S}}krjanec

Paper
Add Code

Corpus-Based Diacritic Restoration for South Slavic Languages

no code implementations • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing.

Paper
Add Code

Improving UD processing via satellite resources for morphology

no code implementations • WS 2019 • Kaja Dobrovoljc, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

Paper
Add Code

The siParl corpus of Slovene parliamentary proceedings

no code implementations • LREC 2020 • Andrej Pancur, Toma{\v{z}} Erjavec

The paper describes the process of acquisition, up-translation, encoding, annotation, and distribution of siParl, a collection of the parliamentary debates from the Assembly of the Republic of Slovenia from 1990{--}2018, covering the period from just before Slovenia became an independent country in 1991, and almost up to the present.

Translation

Paper
Add Code

Gigafida 2.0: The Reference Corpus of Written Standard Slovene

no code implementations • LREC 2020 • Simon Krek, {\v{S}}pela Arhar Holdt, Toma{\v{z}} Erjavec, Jaka {\v{C}}ibej, Andraz Repar, Polona Gantar, Nikola Ljube{\v{s}}i{\'c}, Iztok Kosem, Kaja Dobrovoljc

We describe a new version of the Gigafida reference corpus of Slovene.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.