Search Results for author: Tommi Jauhiainen

Found 25 papers, 4 papers with code

HeLI-OTS, Off-the-shelf Language Identifier for Text

no code implementations LREC 2022 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method.

Language Identification

Experiments in Language Variety Geolocation and Dialect Identification

no code implementations VarDial (COLING) 2020 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects.

Dialect Identification

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

no code implementations VarDial (COLING) 2020 Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task.

Language Identification

A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes

1 code implementation VarDial (COLING) 2022 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the language identification approach used by the SUKI team in the Identification of Languages and Dialects of Italy and the French Cross-Domain Dialect Identification shared tasks organized as part of the VarDial workshop 2022.

Dialect Identification Position

Naive Bayes-based Experiments in Romanian Dialect Identification

no code implementations EACL (VarDial) 2021 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the experiments and systems developed by the SUKI team for the second edition of the Romanian Dialect Identification (RDI) shared task which was organized as part of the 2021 VarDial Evaluation Campaign.

Dialect Identification

Language Variety Identification with True Labels

1 code implementation2 Mar 2023 Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera

Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e. g., Croatian and Serbian) and national language varieties (e. g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety.

Language Identification

Comparing Approaches to Dravidian Language Identification

no code implementations EACL (VarDial) 2021 Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.

Dialect Identification text-classification +1

FinnSentiment -- A Finnish Social Media Corpus for Sentiment Polarity Annotation

no code implementations4 Dec 2020 Krister Lindén, Tommi Jauhiainen, Sam Hardwick

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e. g. when indicating hate speech and fake news.

Opinion Mining Sentence +1

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

no code implementations27 Aug 2020 Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected.

Language Identification

Building Web Corpora for Minority Languages

no code implementations LREC 2020 Heidi Jauhiainen, Tommi Jauhiainen, Krister Lind{\'e}n

Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter.

Language Identification Sentence

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models

no code implementations WS 2019 Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign.

Dialect Identification Language Modelling

A Report on the Third VarDial Evaluation Campaign

no code implementations WS 2019 Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Language Model Adaptation for Language and Dialect Identification of Text

no code implementations26 Mar 2019 Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen

This article describes an unsupervised language model adaptation approach that can be used to enhance the performance of language identification methods.

Dialect Identification Language Modelling

Language and Dialect Identification of Cuneiform Texts

no code implementations WS 2019 Tommi Jauhiainen, Heidi Jauhiainen, Tero Alstola, Krister Lindén

This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus.

Dialect Identification

HeLI-based Experiments in Swiss German Dialect Identification

no code implementations COLING 2018 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

In this paper we present the experiments and results by the SUKI team in the German Dialect Identification shared task of the VarDial 2018 Evaluation Campaign.

Dialect Identification

Iterative Language Model Adaptation for Indo-Aryan Language Identification

no code implementations COLING 2018 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

This paper presents the experiments and results obtained by the SUKI team in the Indo-Aryan Language Identification shared task of the VarDial 2018 Evaluation Campaign.

Language Identification Language Modelling

HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles

no code implementations COLING 2018 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

This paper presents the experiments and results obtained by the SUKI team in the Discriminating between Dutch and Flemish in Subtitles shared task of the VarDial 2018 Evaluation Campaign.

Clustering Language Identification +1

Automatic Language Identification in Texts: A Survey

1 code implementation22 Apr 2018 Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Evaluating HeLI with Non-Linear Mappings

no code implementations WS 2017 Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop.

Language Identification Position

Cannot find the paper you are looking for? You can Submit a new open access paper.