A Report on the Third VarDial Evaluation Campaign

no code implementations WS 2019 Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Encoder-Decoder Methods for Text Normalization

1 code implementation COLING 2018 Massimo Lusetti, Tatyana Ruzsics, Anne G{\"o}hring, Tanja Samard{\v{z}}i{\'c}, Elisabeth Stark

Text normalization has been addressed with a variety of methods, most successfully with character-level statistical machine translation (CSMT).

Decoder Machine Translation +1

Neural Sequence-to-sequence Learning of Internal Word Structure

no code implementations CONLL 2017 Tatyana Ruzsics, Tanja Samard{\v{z}}i{\'c}

Learning internal word structure has recently been recognized as an important step in various multilingual processing tasks and in theoretical language comparison.

Decoder Language Modelling +2

Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages

no code implementations WS 2017 Tanja Samard{\v{z}}i{\'c}, Mirjana Starovi{\'c}, {\v{Z}}eljko Agi{\'c}, Nikola Ljube{\v{s}}i{\'c}

The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines.

TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data

no code implementations COLING 2016 Nikola Ljube{\v{s}}i{\'c}, Tanja Samard{\v{z}}i{\'c}, Curdin Derungs

In this paper we present a newly developed tool that enables researchers interested in spatial variation of language to define a geographic perimeter of interest, collect data from the Twitter streaming API published in that perimeter, filter the obtained data by language and country, define and extract variables of interest and analyse the extracted variables by one spatial statistic and two spatial visualisations.

A Framework for Automatic Acquisition of Croatian and Serbian Verb Aspect from Corpora

no code implementations LREC 2016 Tanja Samard{\v{z}}i{\'c}, Maja Mili{\v{c}}evi{\'c}

Focusing on Croatian and Serbian, in this paper we propose a novel framework for automatic classification of their verb types into a number of fine-grained aspectual classes based on the observable morphology of verb forms.

General Classification

ArchiMob - A Corpus of Spoken Swiss German

no code implementations LREC 2016 Tanja Samard{\v{z}}i{\'c}, Yves Scherrer, Elvira Glaser

Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication.

Machine Translation Part-Of-Speech Tagging +1

