no code implementations • COLING (CogALex) 2020 • REYHANEH HASHEMPOUR, Aline Villavicencio
In this work, we leverage the Idiom Principle (Sinclair et al., 1991) and contextualized word embeddings (CWEs), focusing on Context2Vec (Melamud et al., 2016) and BERT (Devlin et al., 2019) to distinguish between literal and idiomatic senses of such expressions in context.
no code implementations • NAACL (CMCL) 2021 • Peter Vickers, Rosa Wainwright, Harish Tayyar Madabushi, Aline Villavicencio
The CogNLP-Sheffield submissions to the CMCL 2021 Shared Task examine the value of a variety of cognitively and linguistically inspired features for predicting eye tracking patterns, as both standalone model inputs and as supplements to contextual word embeddings (XLNet).
no code implementations • WS 2019 • REYHANEH HASHEMPOUR, Barbara Plank, Aline Villavicencio, Renato Cordeiro de Amorim
Logistic regression (LR), and feed-forward neural networks (FFNN) with back-propagation were used to build models in two different settings: Inter-Lingual (IL) and Cross-Lingual (CL).
1 code implementation • 4 Nov 2024 • wei he, Tiago Kramer Vieira, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio
Idiomatic expressions are an integral part of human languages, often used to express complex ideas in compressed or conventional ways (e. g. eager beaver as a keen and enthusiastic person).
no code implementations • 21 Oct 2024 • Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi
Human processing of idioms relies on understanding the contextual sentences in which idioms occur, as well as language-intrinsic features such as frequency and speaker-intrinsic factors like familiarity.
no code implementations • 30 Sep 2024 • Marina Ribeiro, Bárbara Malcorra, Natália B. Mota, Rodrigo Wilkens, Aline Villavicencio, Lilian C. Hubner, César Rennó-Costa
This paper presents an explainable LLM method, named SLIME (Statistical and Linguistic Insights for Model Explanation), capable of identifying lexical components representative of AD and indicating which components are most important for the LLM's decision.
no code implementations • 21 Jun 2024 • wei he, Marco Idiart, Carolina Scarton, Aline Villavicencio
Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP).
2 code implementations • 17 Jun 2024 • Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English.
no code implementations • 15 May 2024 • Dylan Phelps, Thomas Pickard, Maggie Mi, Edward Gow-Smith, Aline Villavicencio
In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks?
1 code implementation • 14 May 2024 • Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas Pickard, Dylan Phelps, Aline Villavicencio
Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts.
1 code implementation • 16 Feb 2024 • Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.
no code implementations • 15 Jan 2024 • Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio
As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm.
1 code implementation • 26 May 2023 • Kun Zhao, Bohao Yang, Chenghua Lin, Wenge Rong, Aline Villavicencio, Xiaohui Cui
The long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i. e., there may be multiple suitable responses which differ in semantics for a given conversational context.
no code implementations • 23 May 2023 • Rodrigo Wilkens, Leonardo Zilio, Aline Villavicencio
These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena.
1 code implementation • 31 Oct 2022 • Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Aline Villavicencio, Iryna Gurevych
We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two tasks, one of which depends on the other.
no code implementations • LREC (MWE) 2022 • Dylan Phelps, Xuan-Rui Fan, Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio
In particular we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings, on the task of idiomaticity detection.
1 code implementation • SemEval (NAACL) 2022 • Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio
This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context.
1 code implementation • 8 Apr 2022 • Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio
We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks.
1 code implementation • Findings (EMNLP) 2021 • Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline Villavicencio
Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms.
1 code implementation • ACL 2021 • Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio
This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level.
no code implementations • SIGUL (LREC) 2022 • Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier
Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length.
1 code implementation • EACL 2021 • Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio
Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language.
no code implementations • WS 2020 • REYHANEH HASHEMPOUR, Aline Villavicencio
Studies on detecting idiomatic expressions mostly focus on discovering potentially idiomatic expressions disregarding the context.
no code implementations • LREC 2020 • Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
1 code implementation • 11 Oct 2019 • Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013).
1 code implementation • 19 Aug 2019 • Alexandre Salle, Aline Villavicencio
In distributional semantics, the pointwise mutual information ($\mathit{PMI}$) weighting of the cooccurrence matrix performs far better than raw counts.
no code implementations • WS 2019 • Aline Villavicencio
Multiword expressions (MWEs) feature prominently in the mental lexicon of native speakers (Jackendoff, 1997) in all languages and domains, from informal to technical contexts (Biber et al., 1999) with about four MWEs being produced per minute of discourse (Glucksberg, 1989).
1 code implementation • 29 Jun 2019 • Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5].
no code implementations • CL 2019 • Silvio Cordeiro, Aline Villavicencio, Marco Idiart, Carlos Ramisch
General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.
no code implementations • 27 Jul 2018 • Marcely Zanon Boito, Antonios Anastasopoulos, Marika Lekakou, Aline Villavicencio, Laurent Besacier
This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research.
no code implementations • 18 Jun 2018 • Pierre Godard, Marcely Zanon-Boito, Lucas Ondel, Alexandre Berard, François Yvon, Aline Villavicencio, Laurent Besacier
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL).
no code implementations • NAACL 2018 • Felipe Paula, Rodrigo Wilkens, Marco Idiart, Aline Villavicencio
Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia.
1 code implementation • WS 2018 • Alexandre Salle, Aline Villavicencio
The positive effect of adding subword information to word embeddings has been demonstrated for predictive models.
no code implementations • 17 Sep 2017 • Marcely Zanon Boito, Alexandre Berard, Aline Villavicencio, Laurent Besacier
Word discovery is the task of extracting words from unsegmented text.
no code implementations • ACL 2018 • Alexandre Salle, Aline Villavicencio
Increasing the capacity of recurrent neural networks (RNN) usually involves augmenting the size of the hidden layer, with significant increase of computational cost.
no code implementations • WS 2016 • Jorge Alberto Wagner Filho, Rodrigo Wilkens, Aline Villavicencio
In a comparison between shallow and deeper features, the former already produce F-measures of over 0. 75 for Portuguese texts, but the use of additional features results in even better results, in most cases.
1 code implementation • 3 Jun 2016 • Alexandre Salle, Marco Idiart, Aline Villavicencio
The effectiveness of both modifications is shown using word similarity and analogy tasks.
1 code implementation • ACL 2016 • Alexandre Salle, Marco Idiart, Aline Villavicencio
In this paper, we propose LexVec, a new method for generating distributed word representations that uses low-rank, weighted factorization of the Positive Point-wise Mutual Information matrix via stochastic gradient descent, employing a weighting scheme that assigns heavier penalties for errors on frequent co-occurrences while still accounting for negative co-occurrence.
no code implementations • LREC 2016 • Silvio Cordeiro, Carlos Ramisch, Aline Villavicencio
This paper presents mwetoolkit+sem: an extension of the mwetoolkit that estimates semantic compositionality scores for multiword expressions (MWEs) based on word embeddings.
no code implementations • LREC 2016 • Leonardo Zilio, Maria Jos{\'e} Bocorny Finatto, Aline Villavicencio
The sentences from both corpora were annotated separately, so that it is possible to access sentences either from the Cardiology or from the newspaper corpus.
no code implementations • LREC 2016 • Rodrigo Wilkens, Marco Idiart, Aline Villavicencio
Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages.
no code implementations • LREC 2016 • Rodrigo Wilkens, Leonardo Zilio, Eduardo Ferreira, Aline Villavicencio
They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them.
no code implementations • LREC 2014 • Rodrigo Boos, Kassius Prestes, Aline Villavicencio
To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus.
no code implementations • LREC 2014 • Muntsa Padr{\'o}, Marco Idiart, Aline Villavicencio, Carlos Ramisch
Distributional thesauri have been applied for a variety of tasks involving semantic relatedness.
no code implementations • LREC 2014 • Bruno Laranjeira, Viviane Moreira, Aline Villavicencio, Carlos Ramisch, Maria Jos{\'e} Finatto
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing.
no code implementations • LREC 2012 • Aline Villavicencio, Beracah Yankama, Marco Idiart, Robert Berwick
This paper describes such an initiative for combining information from various sources to extend the annotation of the English CHILDES corpora with linguistic, psycholinguistic and distributional information, along with an example illustrating an application of this approach to the extraction of verb alternation information.