no code implementations • LChange (ACL) 2022 • Iiro Rastas, Yann Ciarán Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents.
Optical Character Recognition
Optical Character Recognition (OCR)
no code implementations • NoDaLiDa 2021 • Jouni Luoma, Li-Hsin Chang, Filip Ginter, Sampo Pyysalo
We introduce a corpus with fine-grained named entity annotation for Finnish, following the OntoNotes guidelines to create a resource that is cross-lingually compatible with existing annotations for other languages.
no code implementations • NAACL (BEA) 2022 • Li-Hsin Chang, Jenna Kanerva, Filip Ginter
Automatic grouping of textual answers has the potential of allowing batch grading, but is challenging because the answers, especially longer essays, have many claims.
1 code implementation • Findings (ACL) 2022 • Samuel Rönnqvist, Aki-Juhani Kyröläinen, Amanda Myntti, Filip Ginter, Veronika Laippala
Input saliency methods have recently become a popular tool for explaining predictions of deep learning models in NLP.
no code implementations • 19 Feb 2025 • Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa, Filip Ginter
Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies.
1 code implementation • 13 Jan 2025 • Erik Henriksson, Otto Tarkka, Filip Ginter
To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
no code implementations • 10 Jan 2025 • Emil Nuutinen, Iiro Rastas, Filip Ginter
We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents.
no code implementations • 3 Nov 2023 • Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo
We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
1 code implementation • 18 May 2023 • Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank
Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources.
1 code implementation • 18 May 2023 • Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank
One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain.
no code implementations • 21 Nov 2022 • Muhammad Junaid Haris, Aanchal Upreti, Melih Kurtaran, Filip Ginter, Sebastien Lafond, Sepinoud Azimi
The problem of gender bias is highly prevalent and well known.
1 code implementation • 22 Jun 2022 • Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou
This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims.
1 code implementation • LREC 2022 • Jenna Kanerva, Filip Ginter
The prevailing practice in the academia is to evaluate the model performance on in-domain evaluation data typically set aside from the training corpus.
1 code implementation • 9 Dec 2021 • Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, Filip Ginter
In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i. e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering.
no code implementations • 31 Aug 2021 • Samuel Rönnqvist, Amanda Myntti, Aki-Juhani Kyröläinen, Sampo Pyysalo, Veronika Laippala, Filip Ginter
In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution technique by aggregating explanations of individual examples in text classification to general descriptions of the classes.
no code implementations • 17 Aug 2021 • Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka
This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus.
no code implementations • MoTra (NoDaLiDa) 2021 • Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter
In this paper, we present a quantitative evaluation of differences between alternative translations in a large recently released Finnish paraphrase corpus focusing in particular on non-trivial variation in translation.
no code implementations • 23 Apr 2021 • Li-Hsin Chang, Iiro Rastas, Sampo Pyysalo, Filip Ginter
Essays as a form of assessment test student knowledge on a deeper level than short answer and multiple-choice questions.
1 code implementation • NoDaLiDa 2021 • Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Jenna Saarni, Maija Sevón, Otto Tarkka
Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.
no code implementations • 22 Oct 2020 • Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter
Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years.
no code implementations • WS 2020 • Jenna Kanerva, Filip Ginter, Sampo Pyysalo
We present the approach of the TurkuNLP group to the IWPT 2020 shared task on Multilingual Parsing into Enhanced Universal Dependencies.
no code implementations • NoDaLiDa 2021 • Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter
In this paper, we introduce a simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data and introduce 42 new such models, most for languages up to now lacking dedicated deep neural language models.
no code implementations • LREC 2020 • J{\"o}rg Tiedemann, Tommi Nieminen, Mikko Aulamo, Jenna Kanerva, Akseli Leino, Filip Ginter, Niko Papula
This paper presents FISKM{\"O}, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish.
no code implementations • LREC 2020 • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman
Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework.
1 code implementation • 15 Dec 2019 • Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo
Deep learning-based language models pretrained on large unannotated text corpora have been demonstrated to allow efficient transfer learning for natural language processing, with recent approaches such as the transformer-based BERT model advancing the state of the art across a variety of tasks.
1 code implementation • 2 Dec 2019 • Nelda Kote, Marenglen Biba, Jenna Kanerva, Samuel Rönnqvist, Filip Ginter
In this paper, we present the first publicly available part-of-speech and morphologically tagged corpus for the Albanian language, as well as a neural morphological tagger and lemmatizer trained on it.
no code implementations • WS 2019 • Thang Minh Ngo, Jenna Kanerva, Filip Ginter, Sampo Pyysalo
We present the approach taken by the TurkuNLP group in the CRAFT Structural Annotation task, a shared task on dependency parsing.
1 code implementation • WS 2019 • Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski, Filip Ginter
The multilingual BERT model is trained on 104 languages and meant to serve as a universal language model and tool for encoding sentences.
1 code implementation • WS (NoDaLiDa) 2019 • Jenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter
News articles such as sports game reports are often thought to closely follow the underlying game statistics, but in practice they contain a notable amount of background knowledge, interpretation, insight into the game, and quotes that are not present in the official statistics.
no code implementations • 26 Jun 2019 • Kai Hakala, Aleksi Vesanto, Niko Miekka, Tapio Salakoski, Filip Ginter
A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens.
no code implementations • 3 Feb 2019 • Jenna Kanerva, Filip Ginter, Tapio Salakoski
We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems.
no code implementations • WS 2018 • Kira Droganova, Filip Ginter, Jenna Kanerva, Daniel Zeman
In this paper, we focus on parsing rare and non-trivial constructions, in particular ellipsis.
no code implementations • WS 2018 • Joakim Nivre, Paola Marongiu, Filip Ginter, Jenna Kanerva, Simonetta Montemagni, Sebastian Schuster, Maria Simi
We evaluate two cross-lingual techniques for adding enhanced dependencies to existing treebanks in Universal Dependencies.
no code implementations • CONLL 2018 • Daniel Zeman, Jan Haji{\v{c}}, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, Slav Petrov
Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets.
no code implementations • CONLL 2018 • Jenna Kanerva, Filip Ginter, Niko Miekka, Akseli Leino, Tapio Salakoski
In this paper we describe the TurkuNLP entry at the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies.
Ranked #5 on
Dependency Parsing
on Universal Dependencies
no code implementations • WS 2018 • Hans Moen, Kai Hakala, Laura-Maria Peltonen, Henry Suhonen, Petri Loukasm{\"a}ki, Tapio Salakoski, Filip Ginter, Sanna Salanter{\"a}
Our aim is to allow nurses to write in a narrative manner without having to plan and structure the text with respect to sections and subject headings, instead the system should assist with the assignment of subject headings and restructuring afterwards.
no code implementations • WS 2017 • Juhani Luotolahti, Jenna Kanerva, Filip Ginter
In this paper we present our system in the DiscoMT 2017 Shared Task on Crosslingual Pronoun Prediction.
no code implementations • CONLL 2017 • Jenna Kanerva, Juhani Luotolahti, Filip Ginter
We present the TurkuNLP entry in the CoNLL 2017 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies.
no code implementations • CONLL 2017 • Daniel Zeman, Martin Popel, Milan Straka, Jan Haji{\v{c}}, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkov{\'a}, Jan Haji{\v{c}} jr., Jaroslava Hlav{\'a}{\v{c}}ov{\'a}, V{\'a}clava Kettnerov{\'a}, Zde{\v{n}}ka Ure{\v{s}}ov{\'a}, Jenna Kanerva, Stina Ojala, Anna Missil{\"a}, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, H{\'e}ctor Mart{\'\i}nez Alonso, {\c{C}}a{\u{g}}r{\i} {\c{C}}{\"o}ltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, M, Michael l, Jesse Kirchner, Hector Fern Alcalde, ez, Jana Strnadov{\'a}, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendon{\c{c}}a, L, Tatiana o, Rattima Nitisaroj, Josie Li
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets.
1 code implementation • WS 2017 • Farrokh Mehryary, Kai Hakala, Suwisa Kaewphan, Jari Bj{\"o}rne, Tapio Salakoski, Filip Ginter
The official evaluation shows that the joint performance of our entity detection and relation extraction models outperforms the winning team of the Shared Task by 19pp on F1-score, establishing a new top score for the task.
no code implementations • WS 2017 • Hans Moen, Kai Hakala, Farrokh Mehryary, Laura-Maria Peltonen, Tapio Salakoski, Filip Ginter, Sanna Salanter{\"a}
We study and compare two different approaches to the task of automatic assignment of predefined classes to clinical free-text narratives.
no code implementations • CL (ACL) 2021 • Joakim Nivre, Daniel Zeman, Filip Ginter, Francis Tyers
Universal Dependencies (UD) is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages.
no code implementations • LREC 2016 • Mojgan Seraji, Filip Ginter, Joakim Nivre
The Persian Universal Dependency Treebank (Persian UD) is a recent effort of treebanking Persian with Universal Dependencies (UD), an ongoing project that designs unified and cross-linguistically valid grammatical representations including part-of-speech tags, morphological features, and dependency relations.
no code implementations • LREC 2016 • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Haji{\v{c}}, Christopher D. Manning, Ryan Mcdonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman
Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments.
1 code implementation • 3 Jan 2016 • Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca, Christophe Dessimoz, Tunca Dogan, Kai Hakala, Suwisa Kaewphan, Farrokh Mehryary, Tapio Salakoski, Filip Ginter, Hai Fang, Ben Smithers, Matt Oates, Julian Gough, Petri Törönen, Patrik Koskinen, Liisa Holm, Ching-Tai Chen, Wen-Lian Hsu, Kevin Bryson, Domenico Cozzetto, Federico Minneci, David T Jones, Samuel Chapman, Dukka B K. C., Ishita K Khan, Daisuke Kihara, Dan Ofer, Nadav Rappoport, Amos Stern, Elena Cibrian-Uhalte, Paul Denny, Rebecca E Foulger, Reija Hieta, Duncan Legge, Ruth C Lovering, Michele Magrane, Anna N Melidoni, Prudence Mutowo-Meullenet, Klemens Pichler, Aleksandra Shypitsyna, Biao Li, Pooya Zakeri, Sarah ElShal, Léon-Charles Tranchevent, Sayoni Das, Natalie L Dawson, David Lee, Jonathan G Lees, Ian Sillitoe, Prajwal Bhat, Tamás Nepusz, Alfonso E Romero, Rajkumar Sasidharan, Haixuan Yang, Alberto Paccanaro, Jesse Gillis, Adriana E Sedeño-Cortés, Paul Pavlidis, Shou Feng, Juan M Cejuela, Tatyana Goldberg, Tobias Hamp, Lothar Richter, Asaf Salamov, Toni Gabaldon, Marina Marcet-Houben, Fran Supek, Qingtian Gong, Wei Ning, Yuanpeng Zhou, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Stefano Toppo, Carlo Ferrari, Manuel Giollo, Damiano Piovesan, Silvio Tosatto, Angela del Pozo, José M Fernández, Paolo Maietta, Alfonso Valencia, Michael L Tress, Alfredo Benso, Stefano Di Carlo, Gianfranco Politano, Alessandro Savino, Hafeez Ur Rehman, Matteo Re, Marco Mesiti, Giorgio Valentini, Joachim W Bargsten, Aalt DJ van Dijk, Branislava Gemovic, Sanja Glisic, Vladmir Perovic, Veljko Veljkovic, Nevena Veljkovic, Danillo C Almeida-e-Silva, Ricardo ZN Vencio, Malvika Sharan, Jörg Vogel, Lakesh Kansakar, Shanshan Zhang, Slobodan Vucetic, Zheng Wang, Michael JE Sternberg, Mark N Wass, Rachael P Huntley, Maria J Martin, Claire O'Donovan, Peter N. Robinson, Yves Moreau, Anna Tramontano, Patricia C Babbitt, Steven E Brenner, Michal Linial, Christine A Orengo, Burkhard Rost, Casey S Greene, Sean D Mooney, Iddo Friedberg, Predrag Radivojac
To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2.
Quantitative Methods
no code implementations • LREC 2014 • Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, Christopher D. Manning
Revisiting the now de facto standard Stanford dependency representation, we propose an improved taxonomy to capture grammatical relations across languages, including morphologically rich ones.
no code implementations • TACL 2013 • Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Rich{\'a}rd Farkas, Filip Ginter, Jan Haji{\v{c}}
Joint morphological and syntactic analysis has been proposed as a way of improving parsing accuracy for richly inflected languages.