Search Results for author: Sampo Pyysalo

Found 51 papers, 12 papers with code

Towards better structured and less noisy Web data: Oscar with Register annotations

no code implementations • COLING (WNUT) 2022 • Veronika Laippala, Anna Salmela, Samuel Rönnqvist, Alham Fikri Aji, Li-Hsin Chang, Asma Dhifallah, Larissa Goulart, Henna Kortelainen, Marc Pàmies, Deise Prina Dutra, Valtteri Skantsi, Lintang Sutawika, Sampo Pyysalo

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process.

Paper
Add Code

Toward Multilingual Identification of Online Registers

no code implementations • WS (NoDaLiDa) 2019 • Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Biber, Sampo Pyysalo

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i. e. text varieties with specific situational characteristics.

Multilingual text classification Multilingual Word Embeddings +2

Paper
Add Code

Fine-grained Named Entity Annotation for Finnish

no code implementations • NoDaLiDa 2021 • Jouni Luoma, Li-Hsin Chang, Filip Ginter, Sampo Pyysalo

We introduce a corpus with fine-grained named entity annotation for Finnish, following the OntoNotes guidelines to create a resource that is cross-lingually compatible with existing annotations for other languages.

NER

Paper
Add Code

Poro 34B and the Blessing of Multilinguality

no code implementations • 2 Apr 2024 • Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages.

Paper
Add Code

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

no code implementations • 30 Mar 2024 • Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo

Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility.

Continual Pretraining Language Modelling

Paper
Add Code

A New Massive Multilingual Dataset for High-Performance Language Technologies

no code implementations • 20 Mar 2024 • Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer Van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive.

Language Modelling Machine Translation +2

Paper
Add Code

FinGPT: Large Generative Models for a Small Language

no code implementations • 3 Nov 2023 • Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo

We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.

Paper
Add Code

Scaling Data-Constrained Language Models

1 code implementation • NeurIPS 2023 • Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.

284

Paper
Code

Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction

1 code implementation • 18 May 2023 • Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources.

Relation Relation Extraction +1

Paper
Code

Silver Syntax Pre-training for Cross-Domain Relation Extraction

1 code implementation • 18 May 2023 • Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain.

Relation Relation Extraction

Paper
Code

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

6 code implementations • 9 Nov 2022 • BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas Wolf

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.

Language Modelling Multilingual NLP

2,181

Paper
Code

Explaining Classes through Word Attribution

no code implementations • 31 Aug 2021 • Samuel Rönnqvist, Amanda Myntti, Aki-Juhani Kyröläinen, Sampo Pyysalo, Veronika Laippala, Filip Ginter

In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution technique by aggregating explanations of individual examples in text classification to general descriptions of the classes.

Genre classification text-classification +1

Paper
Add Code

Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases

no code implementations • MoTra (NoDaLiDa) 2021 • Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

In this paper, we present a quantitative evaluation of differences between alternative translations in a large recently released Finnish paraphrase corpus focusing in particular on non-trivial variation in translation.

Translation

Paper
Add Code

Deep learning for sentence clustering in essay grading support

no code implementations • 23 Apr 2021 • Li-Hsin Chang, Iiro Rastas, Sampo Pyysalo, Filip Ginter

Essays as a form of assessment test student knowledge on a deeper level than short answer and multiple-choice questions.

Clustering Multiple-choice +1

Paper
Add Code

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

1 code implementation • EACL 2021 • Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala

We explore cross-lingual transfer of register classification for web documents.

Classification General Classification +1

Paper
Code

Towards Fully Bilingual Deep Language Modeling

no code implementations • 22 Oct 2020 • Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years.

Cross-Lingual Transfer Language Modelling

Paper
Add Code

The birth of Romanian BERT

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Stefan Daniel Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo

Large-scale pretrained language models have become ubiquitous in Natural Language Processing.

Language Modelling

Paper
Code

Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task

no code implementations • WS 2020 • Jenna Kanerva, Filip Ginter, Sampo Pyysalo

We present the approach of the TurkuNLP group to the IWPT 2020 shared task on Multilingual Parsing into Enhanced Universal Dependencies.

Lemmatization

Paper
Add Code

WikiBERT models: deep transfer learning for many languages

no code implementations • NoDaLiDa 2021 • Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter

In this paper, we introduce a simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data and introduce 42 new such models, most for languages up to now lacking dedicated deep neural language models.

Transfer Learning

Paper
Add Code

Exploring Cross-sentence Contexts for Named Entity Recognition with BERT

1 code implementation • COLING 2020 • Jouni Luoma, Sampo Pyysalo

We find that adding context in the form of additional sentences to BERT input systematically increases NER performance on all of the tested languages and models.

Ranked #3 on Named Entity Recognition (NER) on CoNLL 2003 (German)

named-entity-recognition Named Entity Recognition +2

Paper
Code

A Broad-coverage Corpus for Finnish Named Entity Recognition

no code implementations • LREC 2020 • Jouni Luoma, Miika Oinonen, Maria Pyyk{\"o}nen, Veronika Laippala, Sampo Pyysalo

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish.

named-entity-recognition Named Entity Recognition +1

Paper
Add Code

From Web Crawl to Clean Register-Annotated Corpora

no code implementations • LREC 2020 • Veronika Laippala, Samuel R{\"o}nnqvist, Saara Hellstr{\"o}m, Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri Skantsi, Sampo Pyysalo

However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents.

Paper
Add Code

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

no code implementations • LREC 2020 • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework.

Paper
Add Code

Multilingual is not enough: BERT for Finnish

1 code implementation • 15 Dec 2019 • Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

Deep learning-based language models pretrained on large unannotated text corpora have been demonstrated to allow efficient transfer learning for natural language processing, with recent approaches such as the transformer-based BERT model advancing the state of the art across a variety of tasks.

Dependency Parsing named-entity-recognition +4

Paper
Code

Biomedical Named Entity Recognition with Multilingual BERT

1 code implementation • WS 2019 • Kai Hakala, Sampo Pyysalo

We present the approach of the Turku NLP group to the PharmaCoNER task on Spanish biomedical named entity recognition.

named-entity-recognition Named Entity Recognition +1

Paper
Code

CRAFT Shared Tasks 2019 Overview --- Integrated Structure, Semantics, and Coreference

no code implementations • WS 2019 • William Baumgartner, Michael Bada, Sampo Pyysalo, Manuel R. Ciosici, Negacy Hailu, Harrison Pielke-Lombardo, Michael Regan, Lawrence Hunter

As part of the BioNLP Open Shared Tasks 2019, the CRAFT Shared Tasks 2019 provides a platform to gauge the state of the art for three fundamental language processing tasks {---} dependency parse construction, coreference resolution, and ontology concept identification {---} over full-text biomedical articles.

coreference-resolution Dependency Parsing +2

Paper
Add Code

Neural Dependency Parsing of Biomedical Text: TurkuNLP entry in the CRAFT Structural Annotation Task

no code implementations • WS 2019 • Thang Minh Ngo, Jenna Kanerva, Filip Ginter, Sampo Pyysalo

We present the approach taken by the TurkuNLP group in the CRAFT Structural Annotation task, a shared task on dependency parsing.

Dependency Parsing named-entity-recognition +3

Paper
Add Code

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

no code implementations • WS 2017 • Jenna Kanerva, Sampo Pyysalo, Filip Ginter

Word Embeddings

Paper
Add Code

A neural network multi-task learning approach to biomedical named entity recognition

1 code implementation • BMC Bioinformatics 2017 • Gamal Crichton, Sampo Pyysalo, Billy Chiu, Anna Korhonen

Additionally, we investigated the effect of dataset size on performance in both single- and multi-task settings.

Anatomy Multi-Task Learning +3

210

Paper
Code

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

no code implementations • CONLL 2017 • Daniel Zeman, Martin Popel, Milan Straka, Jan Haji{\v{c}}, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkov{\'a}, Jan Haji{\v{c}} jr., Jaroslava Hlav{\'a}{\v{c}}ov{\'a}, V{\'a}clava Kettnerov{\'a}, Zde{\v{n}}ka Ure{\v{s}}ov{\'a}, Jenna Kanerva, Stina Ojala, Anna Missil{\"a}, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, H{\'e}ctor Mart{\'\i}nez Alonso, {\c{C}}a{\u{g}}r{\i} {\c{C}}{\"o}ltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, M, Michael l, Jesse Kirchner, Hector Fern Alcalde, ez, Jana Strnadov{\'a}, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendon{\c{c}}a, L, Tatiana o, Rattima Nitisaroj, Josie Li

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets.

Dependency Parsing

Paper
Add Code

Cancer Hallmark Text Classification Using Convolutional Neural Networks

no code implementations • WS 2016 • Simon Baker, Anna Korhonen, Sampo Pyysalo

Methods based on deep learning approaches have recently achieved state-of-the-art performance in a range of machine learning tasks and are increasingly applied to natural language processing (NLP).

General Classification text-classification +1

Paper
Add Code

Attending to Characters in Neural Sequence Labeling Models

no code implementations • COLING 2016 • Marek Rei, Gamal K. O. Crichton, Sampo Pyysalo

Sequence labeling architectures use word embeddings for capturing similarity, but suffer when handling previously unseen or rare words.

Ranked #7 on Grammatical Error Detection on FCE

Chunking Grammatical Error Detection +3

Paper
Add Code

Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016

no code implementations • WS 2016 • Farrokh Mehryary, Jari Bj{\"o}rne, Sampo Pyysalo, Tapio Salakoski, Filip Ginter

Named Entity Recognition (NER) Relation Extraction

Paper
Add Code

How to Train good Word Embeddings for Biomedical NLP

1 code implementation • WS 2016 • Billy Chiu, Gamal Crichton, Anna Korhonen, Sampo Pyysalo

Learning Word Embeddings Named Entity Recognition (NER) +1

119

Paper
Code

Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance

no code implementations • WS 2016 • Billy Chiu, Anna Korhonen, Sampo Pyysalo

Named Entity Recognition (NER) Part-Of-Speech Tagging +1

Paper
Add Code

Universal Dependencies v1: A Multilingual Treebank Collection

no code implementations • LREC 2016 • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Haji{\v{c}}, Christopher D. Manning, Ryan Mcdonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman

Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments.

Paper
Add Code

Typed Entity and Relation Annotation on Computer Science Papers

1 code implementation • LREC 2016 • Yuka Tateisi, Tomoko Ohta, Sampo Pyysalo, Yusuke Miyao, Akiko Aizawa

In our scheme, mentions of entities are annotated with ontology-based types, and the roles of the entities are annotated as relations with other entities described in the text.

Relation