no code implementations • LREC 2022 • Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren
We present GTP-SW3, a 3. 5 billion parameter autoregressive language model, trained on a newly created 100 GB Swedish corpus.
no code implementations • WS (NoDaLiDa) 2019 • Magnus Sahlgren, Fredrik Olsson
This paper investigates the presence of gender bias in pretrained Swedish embeddings.
1 code implementation • LREC 2022 • Fredrik Carlsson, Philipp Eisen, Faton Rekathati, Magnus Sahlgren
The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP.
Ranked #4 on Zero-shot Image Retrieval on XTD10
no code implementations • EMNLP (MRQA) 2021 • Fredrik Carlsson, Magnus Sahlgren, Fredrik Olsson, Amaru Cuba Gyllensten
This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts.
1 code implementation • ACL 2022 • Fredrik Carlsson, Joey Öhman, Fangyu Liu, Severine Verlinden, Joakim Nivre, Magnus Sahlgren
We propose a resource-efficient method for converting a pre-trained CLM into this architecture, and demonstrate its potential on various experiments, including the novel task of contextualized word inclusion.
no code implementations • NoDaLiDa 2021 • Abdul Aziz Alkathiri, Lodovico Giaretta, Sarunas Girdzijauskas, Magnus Sahlgren
Advanced NLP models require huge amounts of data from various domains to produce high-quality representations.
no code implementations • NoDaLiDa 2021 • Magnus Sahlgren, Fredrik Carlsson, Fredrik Olsson, Love Börjeson
When is it beneficial for a research community to organize a broader collaborative effort on a topic, and when should we instead promote individual efforts?
no code implementations • 22 May 2023 • Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren
This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3.
no code implementations • 30 Mar 2023 • Joey Öhman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets.
2 code implementations • 11 Oct 2021 • Fredrik Olsson, Magnus Sahlgren
In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects.
no code implementations • LREC 2022 • Evangelia Gogoulou, Ariel Ekgren, Tim Isbister, Magnus Sahlgren
Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
1 code implementation • 20 May 2021 • Alessandro Lenci, Magnus Sahlgren, Patrick Jeuniaux, Amaru Cuba Gyllensten, Martina Miliani
In this paper, we perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
1 code implementation • NoDaLiDa 2021 • Tim Isbister, Fredrik Carlsson, Magnus Sahlgren
We demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages.
no code implementations • 19 Apr 2021 • Daniel Garcia Bernal, Lodovico Giaretta, Sarunas Girdzijauskas, Magnus Sahlgren
The results show that neither the quality of the results nor the convergence time in Federated Word2Vec deteriorates as compared to centralised Word2Vec.
no code implementations • EACL 2021 • Evangelia Gogoulou, Magnus Boman, Fehmi ben Abdesslem, Nils Hentati Isacsson, Viktor Kaldo, Magnus Sahlgren
We investigate the feasibility of applying standard text categorisation methods to patient text in order to predict treatment outcome in Internet-based cognitive behavioural therapy.
no code implementations • 8 Feb 2021 • Magnus Sahlgren, Fredrik Carlsson
By contrast, we will argue that there are many different types of language use, meaning and understanding, and that (current) language models are build with the explicit purpose of acquiring and representing one type of structural understanding of language.
1 code implementation • ICLR 2021 • Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, Magnus Sahlgren
Extracting semantically useful natural language sentence representations from pre-trained deep neural networks such as Transformers remains a challenge.
no code implementations • SEMEVAL 2020 • Amaru Cuba Gyllensten, Evangelia Gogoulou, Ariel Ekgren, Magnus Sahlgren
We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Magnus Sahlgren
This paper problematizes the reliance on documents as the basic notion for defining term interactions in standard topic models.
1 code implementation • 7 Sep 2020 • Tim Isbister, Magnus Sahlgren
This paper presents the first Swedish evaluation benchmark for textual semantic similarity.
2 code implementations • 4 Sep 2020 • Fredrik Olsson, Magnus Sahlgren
This document concerns data readiness in the context of machine learning and Natural Language Processing.
no code implementations • LREC 2020 • Fredrik Olsson, Magnus Sahlgren, Fehmi ben Abdesslem, Ariel Ekgren, Kristine Eck
We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP).
no code implementations • WS 2018 • Amaru Cuba Gyllensten, Magnus Sahlgren
Sentiment and topic analysis are common methods used for social media monitoring.
no code implementations • WS 2018 • Magnus Sahlgren, Tim Isbister, Fredrik Olsson
This paper discusses the question whether it is possible to learn a generic representation that is useful for detecting various types of abusive language.
1 code implementation • WS 2019 • Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren
This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques.
no code implementations • 13 Mar 2018 • Tim Isbister, Magnus Sahlgren, Lisa Kaati, Milan Obaidi, Nazar Akrami
Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments.
no code implementations • LREC 2018 • Amaru Cuba Gyllensten, Magnus Sahlgren
This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models.
no code implementations • WS 2016 • Maria Skeppstedt, Magnus Sahlgren, Carita Paradis, Andreas Kerren
This larger variation was also shown by the lower recall results achieved by the lexicon-based approach for sentiment than for the categories speculation, contrast and condition.
no code implementations • EMNLP 2016 • Magnus Sahlgren, Alessandro Lenci
This paper investigates the effects of data size and frequency range on distributional semantic models.
no code implementations • LREC 2016 • Magnus Sahlgren, Amaru Cuba Gyllensten, Fredrik Espinoza, Ola Hamfors, Jussi Karlgren, Fredrik Olsson, Per Persson, Akshay Viswanathan, Anders Holst
This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages.
no code implementations • EMNLP 2015 • Amaru Cuba Gyllensten, Magnus Sahlgren
We also argue that the topology of the neighborhoods in semantic space can be used to determine the semantic horizon of a point, which we define as the set of neighbors that have a direct connection to the point.