Search Results for author: Krister Lindén

Found 16 papers, 3 papers with code

Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes

1 code implementation VarDial (COLING) 2022 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the language identification approach used by the SUKI team in the Identification of Languages and Dialects of Italy and the French Cross-Domain Dialect Identification shared tasks organized as part of the VarDial workshop 2022.

Dialect Identification Position

Naive Bayes-based Experiments in Romanian Dialect Identification

no code implementations EACL (VarDial) 2021 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the experiments and systems developed by the SUKI team for the second edition of the Romanian Dialect Identification (RDI) shared task which was organized as part of the 2021 VarDial Evaluation Campaign.

Dialect Identification

HeLI-OTS, Off-the-shelf Language Identifier for Text

no code implementations LREC 2022 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method.

Language Identification

A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

no code implementations VarDial (COLING) 2020 Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task.

Language Identification

Experiments in Language Variety Geolocation and Dialect Identification

no code implementations VarDial (COLING) 2020 Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects.

Dialect Identification

Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some benchmarks

no code implementations24 Mar 2022 Anssi Moisio, Dejan Porjazovski, Aku Rouhe, Yaroslav Getman, Anja Virkkunen, Tamás Grósz, Krister Lindén, Mikko Kurimo

The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

FinnSentiment -- A Finnish Social Media Corpus for Sentiment Polarity Annotation

no code implementations4 Dec 2020 Krister Lindén, Tommi Jauhiainen, Sam Hardwick

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e. g. when indicating hate speech and fake news.

Opinion Mining Sentence +1

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

no code implementations27 Aug 2020 Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected.

Language Identification

Language Model Adaptation for Language and Dialect Identification of Text

no code implementations26 Mar 2019 Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen

This article describes an unsupervised language model adaptation approach that can be used to enhance the performance of language identification methods.

Dialect Identification Language Modelling

Language and Dialect Identification of Cuneiform Texts

no code implementations WS 2019 Tommi Jauhiainen, Heidi Jauhiainen, Tero Alstola, Krister Lindén

This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus.

Dialect Identification

Automatic Language Identification in Texts: A Survey

1 code implementation22 Apr 2018 Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Cannot find the paper you are looking for? You can Submit a new open access paper.