Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN

no code implementations LREC 2020 Franciska de Jong, Bente Maegaard, Darja Fi{\v{s}}er, Dieter van Uytvanck, Andreas Witt

CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences.

Datasets of Slovene and Croatian Moderated News Comments

no code implementations WS 2018 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.

General Classification

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation WS 2018 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Anita Peti-Stanti{\'c}

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20{\%} in correlation when predicting across languages.

Cross-Lingual Transfer Word Embeddings

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

no code implementations WS 2017 Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.

General Classification

Language-independent Gender Prediction on Twitter

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.

Gender Prediction General Classification

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.

Domain Adaptation Lemmatization +2

Private or Corporate? Predicting User Types on Twitter

no code implementations WS 2016 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er

In this paper we present a series of experiments on discriminating between private and corporate accounts on Twitter.

sloWCrowd: A crowdsourcing tool for lexicographic tasks

no code implementations LREC 2014 Darja Fi{\v{s}}er, Ale{\v{s}} Tav{\v{c}}ar, Toma{\v{z}} Erjavec

The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora.

TweetCaT: a tool for building Twitter corpora of smaller languages

1 code implementation LREC 2014 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.

Language Identification

Cleaning noisy wordnets

no code implementations LREC 2012 Beno{\^\i}t Sagot, Darja Fi{\v{s}}er

Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67{\%} of the proposed outlier candidates are indeed incorrect for French and a 64{\%} for Slovene.

Semantic Textual Similarity Word Sense Disambiguation

