Search Results for author: Nikola Ljube{\v{s}}i{\'c}

We build and evaluate statistical models for lemmatization, morphosyntactic tagging, named entity recognition and dependency parsing on top of SETimes. HR and the test sets, providing the state of the art in all the tasks.

Boundary Detection Dependency Parsing +5

Paper
Add Code

Comparing two acquisition systems for automatically building an English---Croatian parallel corpus from multilingual websites

no code implementations • LREC 2014 • Miquel Espl{\`a}-Gomis, Filip Klubi{\v{c}}ka, Nikola Ljube{\v{s}}i{\'c}, Sergio Ortiz-Rojas, Vassilis Papavassiliou, Prokopis Prokopidis

We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific Englishâ€•Croatian parallel corpus.

Information Retrieval Machine Translation +1

Paper
Add Code

Quality Estimation for Synthetic Parallel Data Generation

no code implementations • LREC 2014 • Raphael Rubino, Antonio Toral, Nikola Ljube{\v{s}}i{\'c}, Gema Ram{\'\i}rez-S{\'a}nchez

This paper presents a novel approach for parallel data generation using machine translation and quality estimation.

Machine Translation Sentence +1

Paper
Add Code

caWaC -- A web corpus of Catalan and its application to language modeling and machine translation

no code implementations • LREC 2014 • Nikola Ljube{\v{s}}i{\'c}, Antonio Toral

In this paper we present the construction process of a web corpus of Catalan built from the content of the . cat top-level domain.

Language Modelling Machine Translation +1

Paper
Add Code

A Report on the DSL Shared Task 2014

no code implementations • WS 2014 • Marcos Zampieri, Liling Tan, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann

Language Identification

Paper
Add Code

Exploring cross-language statistical machine translation for closely related South Slavic languages

no code implementations • WS 2014 • Maja Popovi{\'c}, Nikola Ljube{\v{s}}i{\'c}

Machine Translation Translation

Paper
Add Code

Abu-MaTran: Automatic building of Machine Translation

no code implementations • EAMT 2016 • Antonio Toral, Tommi A. Pirinen, Andy Way, Gema Ram{\'\i}rez-S{\'a}nchez, Sergio Ortiz Rojas, Raphael Rubino, Miquel Espl{\`a}, Mikel L. Forcada, Vassilis Papavassiliou, Prokopis Prokopidis, Nikola Ljube{\v{s}}i{\'c}

Machine Translation Transfer Learning +1

Paper
Add Code

Predicting Inflectional Paradigms and Lemmata of Unknown Words for Semi-automatic Expansion of Morphological Lexicons

no code implementations • RANLP 2015 • Nikola Ljube{\v{s}}i{\'c}, Miquel Espl{\`a}-Gomis, Filip Klubi{\v{c}}ka, Nives Mikeli{\'c} Preradovi{\'c}

Paper
Add Code

Universal Dependencies for Croatian (that work for Serbian, too)

no code implementations • WS 2015 • {\v{Z}}eljko Agi{\'c}, Nikola Ljube{\v{s}}i{\'c}

Cross-Lingual Transfer Dependency Parsing

Paper
Add Code

Overview of the DSL Shared Task 2015

no code implementations • WS 2015 • Marcos Zampieri, Liling Tan, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann, Preslav Nakov

Language Identification

Paper
Add Code

Regional Linguistic Data Initiative (ReLDI)

no code implementations • WS 2015 • Tanja Samard{\v{z}}i{\'c}, Nikola Ljube{\v{s}}i{\'c}, Maja Mili{\v{c}}evi{\'c}

Paper
Add Code

Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling

no code implementations • WS 2015 • Raphael Rubino, Tommi Pirinen, Miquel Espl{\`a}-Gomis, Nikola Ljube{\v{s}}i{\'c}, Sergio Ortiz-Rojas, Vassilis Papavassiliou, Prokopis Prokopidis, Antonio Toral

Machine Translation Translation

Paper
Add Code

Predicting the Level of Text Standardness in User-generated Content

no code implementations • RANLP 2015 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Jaka {\v{C}}ibej, Dafne Marko, Senja Pollak, Iza {\v{S}}krjanec

Paper
Add Code

Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian

no code implementations • WS 2016 • Victor M. S{\'a}nchez-Cartagena, Nikola Ljube{\v{s}}i{\'c}, Filip Klubi{\v{c}}ka

Machine Translation

Paper
Add Code

Collaborative Development of a Rule-Based Machine Translator between Croatian and Serbian

no code implementations • WS 2016 • Filip Klubi{\v{c}}ka, Gema Ram{\'\i}rez-S{\'a}nchez, Nikola Ljube{\v{s}}i{\'c}

Machine Translation

Paper
Add Code

Croatian Error-Annotated Corpus of Non-Professional Written Language

no code implementations • LREC 2016 • Vanja {\v{S}}tefanec, Nikola Ljube{\v{s}}i{\'c}, Jelena Kuva{\v{c}} Kraljevi{\'c}

In the paper authors present the Croatian corpus of non-professional written language.

Paper
Add Code

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

1 code implementation • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec

In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available.

Paper
Code

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair

no code implementations • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Miquel Espl{\`a}-Gomis, Antonio Toral, Sergio Ortiz Rojas, Filip Klubi{\v{c}}ka

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest.

Paper
Add Code

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

no code implementations • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Filip Klubi{\v{c}}ka, {\v{Z}}eljko Agi{\'c}, Ivo-Pavao Jazbec

In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian.

LEMMA

Paper
Add Code

Corpus-Based Diacritic Restoration for South Slavic Languages

no code implementations • LREC 2016 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing.

Paper
Add Code

A Global Analysis of Emoji Usage

no code implementations • WS 2016 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er

Paper
Add Code

TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data

no code implementations • COLING 2016 • Nikola Ljube{\v{s}}i{\'c}, Tanja Samard{\v{z}}i{\'c}, Curdin Derungs

In this paper we present a newly developed tool that enables researchers interested in spatial variation of language to define a geographic perimeter of interest, collect data from the Twitter streaming API published in that perimeter, filter the obtained data by language and country, define and extract variables of interest and analyse the extracted variables by one spatial statistic and two spatial visualisations.

Paper
Add Code

Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian

no code implementations • WS 2016 • Maja Popovi{\'c}, Kostadin Cholakov, Valia Kordoni, Nikola Ljube{\v{s}}i{\'c}

Massive Open Online Courses have been growing rapidly in size and impact.

Machine Translation Translation

Paper
Add Code

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

no code implementations • WS 2016 • Shervin Malmasi, Marcos Zampieri, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial{'}2016 workshop at COLING{'}2016.

Dialect Identification General Classification +1

Paper
Add Code

Private or Corporate? Predicting User Types on Twitter

no code implementations • WS 2016 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er

In this paper we present a series of experiments on discriminating between private and corporate accounts on Twitter.

Paper
Add Code

Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages

no code implementations • WS 2017 • Tanja Samard{\v{z}}i{\'c}, Mirjana Starovi{\'c}, {\v{Z}}eljko Agi{\'c}, Nikola Ljube{\v{s}}i{\'c}

The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines.

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2017

no code implementations • WS 2017 • Marcos Zampieri, Shervin Malmasi, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann, Yves Scherrer, No{\"e}mi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL{'}2017.

Dependency Parsing Dialect Identification

Paper
Add Code

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error.

Domain Adaptation Lemmatization +2

Paper
Add Code

Language-independent Gender Prediction on Twitter

no code implementations • WS 2017 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users{'} tweets.

Gender Prediction General Classification

Paper
Add Code

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

no code implementations • WS 2017 • Darja Fi{\v{s}}er, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia.

General Classification

Paper
Add Code

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation • WS 2018 • Nikola Ljube{\v{s}}i{\'c}, Darja Fi{\v{s}}er, Anita Peti-Stanti{\'c}

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20{\%} in correlation when predicting across languages.

Cross-Lingual Transfer Representation Learning +1

Paper
Code

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

no code implementations • COLING 2018 • Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann, Chris van der Lee, Stefan Grondelaers, Nelleke Oostdijk, Dirk Speelman, Antal Van den Bosch, Ritesh Kumar, Bornini Lahiri, Mayank Jain

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects.

Dependency Parsing Dialect Identification

Paper
Add Code

Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages

no code implementations • COLING 2018 • Nikola Ljube{\v{s}}i{\'c}

This paper presents two systems taking part in the Morphosyntactic Tagging of Tweets shared task on Slovene, Croatian and Serbian data, organized inside the VarDial Evaluation Campaign.

Paper
Add Code

Datasets of Slovene and Croatian Moderated News Comments

no code implementations • WS 2018 • Nikola Ljube{\v{s}}i{\'c}, Toma{\v{z}} Erjavec, Darja Fi{\v{s}}er

Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content.

General Classification

Paper
Add Code

What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian

no code implementations • WS 2019 • Nikola Ljube{\v{s}}i{\'c}, Kaja Dobrovoljc

We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline.

Word Embeddings

Paper
Add Code

Improving UD processing via satellite resources for morphology

no code implementations • WS 2019 • Kaja Dobrovoljc, Toma{\v{z}} Erjavec, Nikola Ljube{\v{s}}i{\'c}

Paper
Add Code

Gigafida 2.0: The Reference Corpus of Written Standard Slovene

no code implementations • LREC 2020 • Simon Krek, {\v{S}}pela Arhar Holdt, Toma{\v{z}} Erjavec, Jaka {\v{C}}ibej, Andraz Repar, Polona Gantar, Nikola Ljube{\v{s}}i{\'c}, Iztok Kosem, Kaja Dobrovoljc

We describe a new version of the Gigafida reference corpus of Slovene.

Paper
Add Code

SemEval-2020 Task 3: Graded Word Similarity in Context

no code implementations • SEMEVAL 2020 • Carlos Santos Armendariz, Matthew Purver, Senja Pollak, Nikola Ljube{\v{s}}i{\'c}, Matej Ul{\v{c}}ar, Ivan Vuli{\'c}, Mohammad Taher Pilehvar

This paper presents the Graded Word Similarity in Context (GWSC) task which asked participants to predict the effects of context on human perception of similarity in English, Croatian, Slovene and Finnish.

Translation Word Similarity

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.