Search Results for author: Tommi Jauhiainen

Found 25 papers, 4 papers with code

HeLI-OTS, Off-the-shelf Language Identifier for Text

no code implementations • LREC 2022 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method.

Language Identification

Paper
Add Code

Experiments in Language Variety Geolocation and Dialect Identification

no code implementations • VarDial (COLING) 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects.

Dialect Identification

Paper
Add Code

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

no code implementations • VarDial (COLING) 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task.

Language Identification

Paper
Add Code

A Report on the VarDial Evaluation Campaign 2020

no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2021

no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.

Dialect Identification

Paper
Add Code

Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes

1 code implementation • VarDial (COLING) 2022 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the language identification approach used by the SUKI team in the Identification of Languages and Dialects of Italy and the French Cross-Domain Dialect Identification shared tasks organized as part of the VarDial workshop 2022.

Dialect Identification Position

Paper
Code

Naive Bayes-based Experiments in Romanian Dialect Identification

no code implementations • EACL (VarDial) 2021 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

This article describes the experiments and systems developed by the SUKI team for the second edition of the Romanian Dialect Identification (RDI) shared task which was organized as part of the 2021 VarDial Evaluation Campaign.

Dialect Identification

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2023

no code implementations • 31 May 2023 • Noëmi Aepli, Çağrı Çöltekin, Rob van der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023.

Intent Detection

Paper
Add Code

Language Variety Identification with True Labels

1 code implementation • 2 Mar 2023 • Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera

Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e. g., Croatian and Serbian) and national language varieties (e. g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety.

Language Identification

Paper
Code

Comparing Approaches to Dravidian Language Identification

no code implementations • EACL (VarDial) 2021 • Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.

Dialect Identification text-classification +1

Paper
Add Code

FinnSentiment -- A Finnish Social Media Corpus for Sentiment Polarity Annotation

no code implementations • 4 Dec 2020 • Krister Lindén, Tommi Jauhiainen, Sam Hardwick

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e. g. when indicating hate speech and fake news.

Opinion Mining Sentence +1

Paper
Add Code

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

no code implementations • 27 Aug 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected.

Language Identification

Paper
Add Code

Building Web Corpora for Minority Languages

no code implementations • LREC 2020 • Heidi Jauhiainen, Tommi Jauhiainen, Krister Lind{\'e}n

Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter.

Language Identification Sentence

Paper
Add Code

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models

no code implementations • WS 2019 • Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign.

Dialect Identification Language Modelling

Paper
Add Code

A Report on the Third VarDial Evaluation Campaign

no code implementations • WS 2019 • Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Paper
Add Code

Language Model Adaptation for Language and Dialect Identification of Text

no code implementations • 26 Mar 2019 • Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen

This article describes an unsupervised language model adaptation approach that can be used to enhance the performance of language identification methods.

Dialect Identification Language Modelling

Paper
Add Code

Language and Dialect Identification of Cuneiform Texts

no code implementations • WS 2019 • Tommi Jauhiainen, Heidi Jauhiainen, Tero Alstola, Krister Lindén

This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus.

Dialect Identification

Paper
Add Code

HeLI-based Experiments in Swiss German Dialect Identification

no code implementations • COLING 2018 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

In this paper we present the experiments and results by the SUKI team in the German Dialect Identification shared task of the VarDial 2018 Evaluation Campaign.

Dialect Identification

Paper
Add Code

Iterative Language Model Adaptation for Indo-Aryan Language Identification

no code implementations • COLING 2018 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

This paper presents the experiments and results obtained by the SUKI team in the Indo-Aryan Language Identification shared task of the VarDial 2018 Evaluation Campaign.

Language Identification Language Modelling

Paper
Add Code

HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles

no code implementations • COLING 2018 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

This paper presents the experiments and results obtained by the SUKI team in the Discriminating between Dutch and Flemish in Subtitles shared task of the VarDial 2018 Evaluation Campaign.

Clustering Language Identification +1

Paper
Add Code

Automatic Language Identification in Texts: A Survey

1 code implementation • 22 Apr 2018 • Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Paper
Code

Evaluation of language identification methods using 285 languages

no code implementations • WS 2017 • Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

Language Identification

Paper
Add Code

Evaluating HeLI with Non-Linear Mappings

no code implementations • WS 2017 • Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop.

Language Identification Position

Paper
Add Code

HeLI, a Word-Based Backoff Method for Language Identification

1 code implementation • WS 2016 • Tommi Jauhiainen, Krister Lind{\'e}n, Heidi Jauhiainen

The shared task comprised of a total of 8 tracks, of which we participated in 7.

Language Identification Position

Paper
Code

Discriminating Similar Languages with Token-Based Backoff

no code implementations • WS 2015 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lind{\'e}n

Language Identification

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.