The influence of fake news in the perception of reality has become a mainstream topic in the last years due to the fast propagation of misleading information.
Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way.
In this paper, we introduce the first SemEval shared task on Structured Sentiment Analysis, for which participants are required to predict all sentiment graphs in a text, where a single sentiment graph is composed of a sentiment holder, target, expression and polarity.
Comprehensive experimentation with language models for Spanish shows that sometimes multilingual models fare better than monolingual ones, even outperforming models which have been adapted to the medical domain.
Extensive experimentation on a newly collected and annotated multilingual (French, English, and Spanish) dataset composed of tourism-related tweets shows that current few-shot learning techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 10% of the tweets for location detection (around 160) and 13% (200 approx.)
Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out-of-the-box.
Ranked #1 on Zero-shot Named Entity Recognition (NER) on HarveyNER (using extra training data)
no code implementations • 9 Jun 2023 • Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau, Anar Yeginbergenova
Providing high quality explanations for AI predictions based on machine learning is a challenging and complex task.
Detecting and normalizing temporal expressions is an essential step for many NLP tasks.
Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance.
In the absence of readily available labeled data for a given sequence labeling task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data.
Ranked #1 on Cross-Lingual NER on MasakhaNER2.0 (Hausa metric)
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released.
Zero-resource cross-lingual transfer approaches aim to apply supervised models from a source language to unlabelled target languages.
Ranked #1 on Cross-Lingual NER on CoNLL 2003
The lack of wide coverage datasets annotated with everyday metaphorical expressions for languages other than English is striking.
The large majority of the research performed on stance detection has been focused on developing more or less sophisticated text classification systems, even when many benchmarks are based on social network data such as Twitter.
Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies.
For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100.
The growing interest in employing counter narratives for hatred intervention brings with it a focus on dataset creation and automation strategies.
While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English.
The TW-10 referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish.
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings.
In this paper we describe our participation to the Hyperpartisan News Detection shared task at SemEval 2019.
In this research note we present a language independent system to model Opinion Target Extraction (OTE) as a sequence labelling task.
This paper presents a simple, robust and (almost) unsupervised dictionary-based method, qwn-ppv (Q-WordNet as Personalized PageRanking Vector) to automatically generate polarity lexicons.
In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and cross-lingual data sources.
Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models.
Ranked #63 on Named Entity Recognition (NER) on CoNLL 2003 (English)
IXA pipeline is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology.
In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation.
Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process.