no code implementations • LREC 2022 • Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren
We present GTP-SW3, a 3. 5 billion parameter autoregressive language model, trained on a newly created 100 GB Swedish corpus.
no code implementations • 22 May 2023 • Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren
This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3.
no code implementations • 30 Mar 2023 • Joey Öhman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets.
no code implementations • LREC 2022 • Evangelia Gogoulou, Ariel Ekgren, Tim Isbister, Magnus Sahlgren
Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
no code implementations • SEMEVAL 2020 • Amaru Cuba Gyllensten, Evangelia Gogoulou, Ariel Ekgren, Magnus Sahlgren
We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++.
no code implementations • LREC 2020 • Fredrik Olsson, Magnus Sahlgren, Fehmi ben Abdesslem, Ariel Ekgren, Kristine Eck
We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP).
1 code implementation • WS 2019 • Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren
This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques.