Search Results for author: Nikola Ljube{\v{s}}i{\'c}

Found 45 papers, 3 papers with code

TweetCaT: a tool for building Twitter corpora of smaller languages

1 code implementation LREC 2014 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.

Language Identification

The SETimes.HR Linguistically Annotated Corpus of Croatian

no code implementations LREC 2014 {\v{Z}}eljko Agi{\'c}, Nikola Ljube{\v{s}}i{\'c}

We build and evaluate statistical models for lemmatization, morphosyntactic tagging, named entity recognition and dependency parsing on top of SETimes. HR and the test sets, providing the state of the art in all the tasks.

Boundary Detection Dependency Parsing +5

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

1 code implementation LREC 2016 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec

In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available.

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair

no code implementations LREC 2016 Nikola Ljube{\v{s}}i{\'c}, Miquel Espl{\`a}-Gomis, Antonio Toral, Sergio Ortiz Rojas, Filip Klubi{\v{c}}ka

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest.

TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data

no code implementations COLING 2016 Nikola Ljube{\v{s}}i{\'c}, Tanja Samard{\v{z}}i{\'c}, Curdin Derungs

In this paper we present a newly developed tool that enables researchers interested in spatial variation of language to define a geographic perimeter of interest, collect data from the Twitter streaming API published in that perimeter, filter the obtained data by language and country, define and extract variables of interest and analyse the extracted variables by one spatial statistic and two spatial visualisations.

Private or Corporate? Predicting User Types on Twitter

no code implementations WS 2016 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er

In this paper we present a series of experiments on discriminating between private and corporate accounts on Twitter.

Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages

no code implementations WS 2017 Tanja Samard{\v{z}}i{\'c}, Mirjana Starovi{\'c}, {\v{Z}}eljko Agi{\'c}, Nikola Ljube{\v{s}}i{\'c}

The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines.

Findings of the VarDial Evaluation Campaign 2017

no code implementations WS 2017 Marcos Zampieri, Shervin Malmasi, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann, Yves Scherrer, No{\"e}mi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL{'}2017.

Dependency Parsing Dialect Identification

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.

Domain Adaptation Lemmatization +2

Language-independent Gender Prediction on Twitter

no code implementations WS 2017 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.

Gender Prediction General Classification

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

no code implementations WS 2017 Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.

General Classification

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation WS 2018 Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Anita Peti-Stanti{\'c}

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20{\%} in correlation when predicting across languages.

Cross-Lingual Transfer Representation Learning +1

Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages

no code implementations COLING 2018 Nikola Ljube{\v{s}}i{\'c}

This paper presents two systems taking part in the Morphosyntactic Tagging of Tweets shared task on Slovene, Croatian and Serbian data, organized inside the VarDial Evaluation Campaign.

Datasets of Slovene and Croatian Moderated News Comments

no code implementations WS 2018 Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.

General Classification

What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian

no code implementations WS 2019 Nikola Ljube{\v{s}}i{\'c}, Kaja Dobrovoljc

We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline.

Word Embeddings

SemEval-2020 Task 3: Graded Word Similarity in Context

no code implementations SEMEVAL 2020 Carlos Santos Armendariz, Matthew Purver, Senja Pollak, Nikola Ljube{\v{s}}i{\'c}, Matej Ul{\v{c}}ar, Ivan Vuli{\'c}, Mohammad Taher Pilehvar

This paper presents the Graded Word Similarity in Context (GWSC) task which asked participants to predict the effects of context on human perception of similarity in English, Croatian, Slovene and Finnish.

Translation Word Similarity

Cannot find the paper you are looking for? You can Submit a new open access paper.