Search Results for author: Aline Villavicencio

Found 53 papers, 14 papers with code

CogNLP-Sheffield at CMCL 2021 Shared Task: Blending Cognitively Inspired Features with Transformer-based Language Models for Predicting Eye Tracking Patterns

no code implementations NAACL (CMCL) 2021 Peter Vickers, Rosa Wainwright, Harish Tayyar Madabushi, Aline Villavicencio

The CogNLP-Sheffield submissions to the CMCL 2021 Shared Task examine the value of a variety of cognitively and linguistically inspired features for predicting eye tracking patterns, as both standalone model inputs and as supplements to contextual word embeddings (XLNet).

Word Embeddings

Leveraging Contextual Embeddings and Idiom Principle for Detecting Idiomaticity in Potentially Idiomatic Expressions

no code implementations COLING (CogALex) 2020 REYHANEH HASHEMPOUR, Aline Villavicencio

In this work, we leverage the Idiom Principle (Sinclair et al., 1991) and contextualized word embeddings (CWEs), focusing on Context2Vec (Melamud et al., 2016) and BERT (Devlin et al., 2019) to distinguish between literal and idiomatic senses of such expressions in context.

Text Simplification Word Embeddings

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference

1 code implementation16 Feb 2024 Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.

Natural Language Understanding

Word Boundary Information Isn't Useful for Encoder Language Models

no code implementations15 Jan 2024 Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm.

NER Sentence

Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information

1 code implementation26 May 2023 Kun Zhao, Bohao Yang, Chenghua Lin, Wenge Rong, Aline Villavicencio, Xiaohui Cui

The long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i. e., there may be multiple suitable responses which differ in semantics for a given conversational context.

Semantic Similarity Semantic Textual Similarity +1

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

no code implementations23 May 2023 Rodrigo Wilkens, Leonardo Zilio, Aline Villavicencio

These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions (MWEs), thus allowing for an assessment of whether the model has learned different linguistic phenomena.

Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5

1 code implementation31 Oct 2022 Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Aline Villavicencio, Iryna Gurevych

We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two tasks, one of which depends on the other.

Multi-Task Learning Natural Language Inference

Sample Efficient Approaches for Idiomaticity Detection

no code implementations LREC (MWE) 2022 Dylan Phelps, Xuan-Rui Fan, Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

In particular we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings, on the task of idiomaticity detection.

SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

1 code implementation SemEval (NAACL) 2022 Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio

This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context.

Binary Classification Sentence +4

Improving Tokenisation by Alternative Treatment of Spaces

1 code implementation8 Apr 2022 Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks.

Natural Language Understanding

AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

1 code implementation Findings (EMNLP) 2021 Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline Villavicencio

Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms.

Language Modelling

Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels

1 code implementation ACL 2021 Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio

This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level.

Vocal Bursts Type Prediction

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

no code implementations SIGUL (LREC) 2022 Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length.

Probing for idiomaticity in vector space models

1 code implementation EACL 2021 Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio

Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language.

Token Level Identification of Multiword Expressions Using Contextual Information

no code implementations WS 2020 REYHANEH HASHEMPOUR, Aline Villavicencio

Studies on detecting idiomatic expressions mostly focus on discovering potentially idiomatic expressions disregarding the context.

Word Embeddings

Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

no code implementations LREC 2020 Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.

Segmentation Translation

How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

1 code implementation11 Oct 2019 Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013).

Why So Down? The Role of Negative (and Positive) Pointwise Mutual Information in Distributional Semantics

1 code implementation19 Aug 2019 Alexandre Salle, Aline Villavicencio

In distributional semantics, the pointwise mutual information ($\mathit{PMI}$) weighting of the cooccurrence matrix performs far better than raw counts.

When the whole is greater than the sum of its parts: Multiword expressions and idiomaticity

no code implementations WS 2019 Aline Villavicencio

Multiword expressions (MWEs) feature prominently in the mental lexicon of native speakers (Jackendoff, 1997) in all languages and domains, from informal to technical contexts (Biber et al., 1999) with about four MWEs being produced per minute of discourse (Glucksberg, 1989).

Sentence

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings

1 code implementation29 Jun 2019 Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5].

Machine Translation

Unsupervised Compositionality Prediction of Nominal Compounds

no code implementations CL 2019 Silvio Cordeiro, Aline Villavicencio, Marco Idiart, Carlos Ramisch

General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

A small Griko-Italian speech translation corpus

no code implementations27 Jul 2018 Marcely Zanon Boito, Antonios Anastasopoulos, Marika Lekakou, Aline Villavicencio, Laurent Besacier

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research.

Translation

Unsupervised Word Segmentation from Speech with Attention

no code implementations18 Jun 2018 Pierre Godard, Marcely Zanon-Boito, Lucas Ondel, Alexandre Berard, François Yvon, Aline Villavicencio, Laurent Besacier

We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL).

Acoustic Unit Discovery Machine Translation +2

Incorporating Subword Information into Matrix Factorization Word Embeddings

1 code implementation WS 2018 Alexandre Salle, Aline Villavicencio

The positive effect of adding subword information to word embeddings has been demonstrated for predictive models.

Word Embeddings

Restricted Recurrent Neural Tensor Networks: Exploiting Word Frequency and Compositionality

no code implementations ACL 2018 Alexandre Salle, Aline Villavicencio

Increasing the capacity of recurrent neural networks (RNN) usually involves augmenting the size of the hidden layer, with significant increase of computational cost.

Language Modelling Tensor Networks

Automatic Construction of Large Readability Corpora

no code implementations WS 2016 Jorge Alberto Wagner Filho, Rodrigo Wilkens, Aline Villavicencio

In a comparison between shallow and deeper features, the former already produce F-measures of over 0. 75 for Portuguese texts, but the use of additional features results in even better results, in most cases.

Text Classification Text Simplification

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

1 code implementation ACL 2016 Alexandre Salle, Marco Idiart, Aline Villavicencio

In this paper, we propose LexVec, a new method for generating distributed word representations that uses low-rank, weighted factorization of the Positive Point-wise Mutual Information matrix via stochastic gradient descent, employing a weighting scheme that assigns heavier penalties for errors on frequent co-occurrences while still accounting for negative co-occurrence.

Word Similarity

B2SG: a TOEFL-like Task for Portuguese

no code implementations LREC 2016 Rodrigo Wilkens, Leonardo Zilio, Eduardo Ferreira, Aline Villavicencio

They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them.

Multiword Expressions in Child Language

no code implementations LREC 2016 Rodrigo Wilkens, Marco Idiart, Aline Villavicencio

Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages.

Language Acquisition

mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing

no code implementations LREC 2016 Silvio Cordeiro, Carlos Ramisch, Aline Villavicencio

This paper presents mwetoolkit+sem: an extension of the mwetoolkit that estimates semantic compositionality scores for multiword expressions (MWEs) based on word embeddings.

Word Embeddings

VerbLexPor: a lexical resource with semantic roles for Portuguese

no code implementations LREC 2016 Leonardo Zilio, Maria Jos{\'e} Bocorny Finatto, Aline Villavicencio

The sentences from both corpora were annotated separately, so that it is possible to access sentences either from the Cardiology or from the newspaper corpus.

Sentence

Identification of Multiword Expressions in the brWaC

no code implementations LREC 2014 Rodrigo Boos, Kassius Prestes, Aline Villavicencio

To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus.

Information Retrieval Machine Translation +1

Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them

no code implementations LREC 2014 Bruno Laranjeira, Viviane Moreira, Aline Villavicencio, Carlos Ramisch, Maria Jos{\'e} Finatto

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing.

Machine Translation Translation

A large scale annotated child language construction database

no code implementations LREC 2012 Aline Villavicencio, Beracah Yankama, Marco Idiart, Robert Berwick

This paper describes such an initiative for combining information from various sources to extend the annotation of the English CHILDES corpora with linguistic, psycholinguistic and distributional information, along with an example illustrating an application of this approach to the extraction of verb alternation information.

Language Acquisition POS +1

Cannot find the paper you are looking for? You can Submit a new open access paper.