Search Results for author: Niko Partanen

Found 36 papers, 13 papers with code

Linguistic change and historical periodization of Old Literary Finnish

no code implementations • ACL (LChange) 2021 • Niko Partanen, Khalid Alnajjar, Mika Hämäläinen, Jack Rueter

In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola.

Lemmatization Word Embeddings

Paper
Add Code

Processing M.A. Castrén’s Materials: Multilingual Historical Typed and Handwritten Manuscripts

no code implementations • NLP4DH (ICON) 2021 • Niko Partanen, Jack Rueter, Khalid Alnajjar, Mika Hämäläinen

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813–1852).

Paper
Add Code

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

no code implementations • VarDial (COLING) 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task.

Language Identification

Paper
Add Code

A Report on the VarDial Evaluation Campaign 2020

no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2021

no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.

Dialect Identification

Paper
Add Code

The Relevance of the Source Language in Transfer Learning for ASR

no code implementations • ComputEL 2021 • Nils Hjortnaes, Niko Partanen, Michael Rießler, Francis M. Tyers

Transfer Learning

Paper
Add Code

Semiautomatic Speech Alignment for Under-Resourced Languages

no code implementations • EURALI (LREC) 2022 • Juho Leinonen, Niko Partanen, Sami Virpioja, Mikko Kurimo

Cross-language forced alignment is a solution for linguists who create speech corpora for very low-resource languages.

Paper
Add Code

Keyword spotting for audiovisual archival search in Uralic languages

no code implementations • ACL (IWCLUL) 2021 • Nils Hjortnaes, Niko Partanen, Francis M. Tyers

Keyword Spotting

Paper
Add Code

Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and future

1 code implementation • ACL (IWCLUL) 2021 • Jack Rueter, Niko Partanen, Mika Hämäläinen, Trond Trosterud

Paper
Code

Numerals and what counts

no code implementations • UDW (SyntaxFest) 2021 • Jack Rueter, Niko Partanen, Flammie A. Pirinen

Paper
Add Code

Processing M.A. Castrén's Materials: Multilingual Typed and Handwritten Manuscripts

no code implementations • 28 Dec 2021 • Niko Partanen, Jack Rueter, Mika Hämäläinen, Khalid Alnajjar

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852).

Paper
Add Code

Detecting Depression in Thai Blog Posts: a Dataset and a Baseline

no code implementations • WNUT (ACL) 2021 • Mika Hämäläinen, Pattama Patpong, Khalid Alnajjar, Niko Partanen, Jack Rueter

We present the first openly available corpus for detecting depression in Thai.

Paper
Add Code

Finnish Dialect Identification: The Effect of Audio and Text

1 code implementation • EMNLP 2021 • Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter

Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice.

Dialect Identification

Paper
Code

How Cute is Pikachu? Gathering and Ranking Pokémon Properties from Data with Pokémon Word Embeddings

no code implementations • 21 Aug 2021 • Mika Hämäläinen, Khalid Alnajjar, Niko Partanen

Based on our experiments, it is better to train a model with domain specific data than to use a pretrained model.

Descriptive Word Embeddings

Paper
Add Code

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

1 code implementation • JEP/TALN/RECITAL 2021 • Mika Hämäläinen, Niko Partanen, Khalid Alnajjar

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century.

Lemmatization

Paper
Code

Apurinã Universal Dependencies Treebank

no code implementations • NAACL (AmericasNLP) 2021 • Jack Rueter, Marília Fernanda Pereira de Freitas, Sidney da Silva Facundes, Mika Hämäläinen, Niko Partanen

The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon.

Paper
Add Code

Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline

no code implementations • NAACL (NLP4IF) 2021 • Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter

However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97. 2%.

Paper
Add Code

Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered

1 code implementation • NoDaLiDa 2021 • Mika Hämäläinen, Niko Partanen, Jack Rueter, Khalid Alnajjar

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages.

Lemmatization Morphological Analysis

Paper
Code

Normalization of Different Swedish Dialects Spoken in Finland

1 code implementation • 9 Dec 2020 • Mika Hämäläinen, Niko Partanen, Khalid Alnajjar

Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions.

Paper
Code

Speech Recognition for Endangered and Extinct Samoyedic languages

no code implementations • PACLIC 2020 • Niko Partanen, Mika Hämäläinen, Tiina Klooster

Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia.

speech-recognition Speech Recognition

Paper
Add Code

Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement

1 code implementation • COLING 2020 • Khalid Alnajjar, Mika Hämäläinen, Jack Rueter, Niko Partanen

We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors.

Paper
Code

Open-Source Morphology for Endangered Mordvinic Languages

2 code implementations • 11 Nov 2020 • Jack Rueter, Mika Hämäläinen, Niko Partanen

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha.

Unity

Paper
Code

Automated Prediction of Medieval Arabic Diacritics

1 code implementation • 11 Oct 2020 • Khalid Alnajjar, Mika Hämäläinen, Niko Partanen, Jack Rueter

This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic.

Machine Translation Translation

Paper
Code

On the questions in developing computational infrastructure for Komi-Permyak

1 code implementation • WS 2020 • Jack Rueter, Niko Partanen, Larisa Ponomareva

Paper
Code

Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language

no code implementations • WS 2020 • Nils Hjortnaes, Niko Partanen, Michael Rie{\ss}ler, Francis M. Tyers

Paper
Add Code

Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

1 code implementation • 6 Sep 2020 • Mika Hämäläinen, Niko Partanen, Khalid Alnajjar, Jack Rueter, Thierry Poibeau

The models are tested with over 20 different dialects.

NMT Transfer Learning

Paper
Code

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

no code implementations • 27 Aug 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected.

Language Identification

Paper
Add Code

Improving the Language Model for Low-Resource ASR with Online Text Corpora

no code implementations • LREC 2020 • Nils Hjortnaes, Timofey Arkhangelskiy, Niko Partanen, Michael Rie{\ss}ler, Francis Tyers

Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Dialect Text Normalization to Normative Standard Finnish

1 code implementation • WS 2019 • Niko Partanen, Mika H{\"a}m{\"a}l{\"a}inen, Khalid Alnajjar

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish.