no code implementations • ACL 2022 • Clara Meister, Gian Wiher, Tiago Pimentel, Ryan Cotterell
When generating natural language from neural probabilistic models, high probability does not always coincide with high quality.
no code implementations • ACL (SIGMORPHON) 2021 • Tiago Pimentel, Maria Ryskina, Sabrina J. Mielke, Shijie Wu, Eleanor Chodroff, Brian Leonard, Garrett Nicolai, Yustinus Ghanggo Ate, Salam Khalifa, Nizar Habash, Charbel El-Khaissi, Omer Goldman, Michael Gasser, William Lane, Matt Coler, Arturo Oncevay, Jaime Rafael Montoya Samame, Gema Celeste Silva Villegas, Adam Ek, Jean-Philippe Bernardy, Andrey Shcherbakov, Aziyana Bayyr-ool, Karina Sheifer, Sofya Ganieva, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Andrew Krizhanovsky, Natalia Krizhanovsky, Clara Vania, Sardana Ivanova, Aelita Salchak, Christopher Straughn, Zoey Liu, Jonathan North Washington, Duygu Ataman, Witold Kieraś, Marcin Woliński, Totok Suhardijanto, Niklas Stoehr, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Richard J. Hatcher, Emily Prud'hommeaux, Ritesh Kumar, Mans Hulden, Botond Barta, Dorina Lakatos, Gábor Szolnok, Judit Ács, Mohit Raj, David Yarowsky, Ryan Cotterell, Ben Ambridge, Ekaterina Vylomova
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features.
1 code implementation • EMNLP 2021 • Tiago Pimentel, Clara Meister, Elizabeth Salesky, Simone Teufel, Damián Blasi, Ryan Cotterell
We thus conclude that there is strong evidence of a surprisal–duration trade-off in operation, both across and within the world’s languages.
no code implementations • 19 Dec 2024 • Philip Whittington, Gregor Bachmann, Tiago Pimentel
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
no code implementations • 23 Oct 2024 • Clara Meister, Mario Giulianelli, Tiago Pimentel
Surprisal theory posits that the cognitive effort required to comprehend a word is determined by its contextual predictability, quantified as surprisal.
1 code implementation • 14 Oct 2024 • Daniel Gareev, Thomas Hofmann, Ezhilmathi Krishnasamy, Tiago Pimentel
Traditional methods, such as top-$k$ and top-$\pi$, apply local normalisation to the model's output distribution, which can distort it.
1 code implementation • 27 Jul 2024 • Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, Alex Warstadt
We vary the age of exposure by training LMs on language pairs in various experimental conditions, and find that LMs, which lack any direct analog to innate maturational stages, do not show CP effects when the age of exposure of L2 is delayed.
1 code implementation • 20 Jun 2024 • Tiago Pimentel, Clara Meister
Language models (LMs) estimate a probability distribution over strings in a natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research.
1 code implementation • 6 Jun 2024 • Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel
Understanding memorisation in language models has practical and societal implications, e. g., studying models' training dynamics or preventing copyright infringements.
1 code implementation • 11 Apr 2024 • Anton Schäfer, Shauli Ravfogel, Thomas Hofmann, Tiago Pimentel, Imanol Schlag
In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages.
1 code implementation • 9 Apr 2024 • Anton Schäfer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel
In this paper, we study the impact of near duplicate subwords on LM training efficiency.
1 code implementation • 6 Dec 2023 • Tiago Pimentel, Clara Meister, Ethan Gotlieb Wilcox, Kyle Mahowald, Ryan Cotterell
Under this method, we find that a language's word lengths should instead be proportional to the surprisal's expectation plus its variance-to-mean ratio.
1 code implementation • 28 Nov 2023 • Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, Ryan Cotterell, Alex Warstadt, Ethan Wilcox, Tamar Regev
Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings.
no code implementations • 27 Nov 2023 • Andreas Opedal, Eleftheria Tsipidi, Tiago Pimentel, Ryan Cotterell, Tim Vieira
The left-corner transformation (Rosenkrantz and Lewis, 1970) is used to remove left recursion from context-free grammars, which is an important step towards making the grammar parsable top-down with simple techniques.
1 code implementation • 7 Jul 2023 • Clara Meister, Tiago Pimentel, Luca Malagutti, Ethan G. Wilcox, Ryan Cotterell
While this trade-off is not reflected in standard metrics of distribution quality (such as perplexity), we find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution.
no code implementations • 7 Jul 2023 • Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy
We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families.
1 code implementation • 6 Jun 2023 • Thomas Hikaru Clark, Clara Meister, Tiago Pimentel, Michael Hahn, Ryan Cotterell, Richard Futrell, Roger Levy
Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically.
1 code implementation • 26 May 2023 • Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, Yanai Elazar
In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B.
no code implementations • 20 Dec 2022 • Li Du, Lucas Torroba Hennigen, Tiago Pimentel, Clara Meister, Jason Eisner, Ryan Cotterell
Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings.
no code implementations • 19 Dec 2022 • Clara Meister, Wojciech Stokowiec, Tiago Pimentel, Lei Yu, Laura Rimell, Adhiguna Kuncoro
After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, making it difficult to estimate the probability distribution over next tokens.
1 code implementation • 25 Nov 2022 • Tiago Pimentel, Clara Meister, Ethan G. Wilcox, Roger Levy, Ryan Cotterell
We assess the effect of anticipation on reading by comparing how well surprisal and contextual entropy predict reading times on four naturalistic reading datasets: two self-paced and two eye-tracking.
no code implementations • 11 Nov 2022 • Tiago Pimentel, Josef Valvoda, Niklas Stoehr, Ryan Cotterell
This shift in perspective leads us to propose a new principle for probing, the architectural bottleneck principle: In order to estimate how much information a given component could extract, a probe should look exactly like the component.
no code implementations • 6 Oct 2022 • Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin
We present a taxonomy for characterising and understanding generalisation research in NLP.
1 code implementation • 14 Sep 2022 • Clemente Pasti, Andreas Opedal, Tiago Pimentel, Tim Vieira, Jason Eisner, Ryan Cotterell
It shows, by a simple construction, that the intersection of a context-free language and a regular language is itself context-free.
no code implementations • 15 Jun 2022 • Xin Xin, Tiago Pimentel, Alexandros Karatzoglou, Pengjie Ren, Konstantina Christakopoulou, Zhaochun Ren
As reinforcement learning (RL) naturally fits this objective -- maximizing an user's reward per session -- it has become an emerging topic in recommender systems.
1 code implementation • 31 May 2022 • Tiago Pimentel, Clara Meister, Ryan Cotterell
As we show, however, this is not a tight approximation -- in either theory or practice.
1 code implementation • 14 May 2022 • Afra Amini, Tiago Pimentel, Clara Meister, Ryan Cotterell
Probing has become a go-to methodology for interpreting and analyzing deep neural models in natural language processing.
no code implementations • LREC 2022 • Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova
The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.
no code implementations • ACL 2022 • Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, Ryan Cotterell
We also find that BERT uses a separate encoding of grammatical number for nouns and verbs.
no code implementations • 31 Mar 2022 • Clara Meister, Gian Wiher, Tiago Pimentel, Ryan Cotterell
Specifically, we posit that human-like language should contain an amount of information (quantified as negative log-probability) that is close to the entropy of the distribution over natural strings.
no code implementations • ACL 2022 • Clara Meister, Tiago Pimentel, Thomas Hikaru Clark, Ryan Cotterell, Roger Levy
Numerous analyses of reading time (RT) data have been implemented -- all in an effort to better understand the cognitive processes driving reading comprehension.
3 code implementations • 1 Feb 2022 • Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell
Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, locally typical sampling offers competitive performance (in both abstractive summarization and story generation) in terms of quality while consistently reducing degenerate repetitions.
1 code implementation • 30 Sep 2021 • Tiago Pimentel, Clara Meister, Elizabeth Salesky, Simone Teufel, Damián Blasi, Ryan Cotterell
We thus conclude that there is strong evidence of a surprisal--duration trade-off in operation, both across and within the world's languages.
1 code implementation • EMNLP 2021 • Tiago Pimentel, Clara Meister, Simone Teufel, Ryan Cotterell
Homophony's widespread presence in natural languages is a controversial topic.
no code implementations • EMNLP 2021 • Clara Meister, Tiago Pimentel, Patrick Haller, Lena Jäger, Ryan Cotterell, Roger Levy
The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal.
1 code implementation • EMNLP 2021 • Tiago Pimentel, Ryan Cotterell
Pimentel et al. (2020) recently analysed probing from an information-theoretic perspective.
1 code implementation • Findings (ACL) 2021 • Irene Nikkarinen, Tiago Pimentel, Damián E. Blasi, Ryan Cotterell
The unigram distribution is the non-contextual probability of finding a specific word form in a corpus.
no code implementations • NAACL 2021 • Jennifer C. White, Tiago Pimentel, Naomi Saphra, Ryan Cotterell
Probes are models devised to investigate the encoding of knowledge -- e. g. syntactic structure -- in contextual representations.
no code implementations • NAACL 2021 • Tiago Pimentel, Irene Nikkarinen, Kyle Mahowald, Ryan Cotterell, Damián Blasi
Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality and to explore the relative costs of major constraints on natural codes.
1 code implementation • 15 Apr 2021 • Karolina Stańczak, Sagnik Ray Choudhury, Tiago Pimentel, Ryan Cotterell, Isabelle Augenstein
Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language.
2 code implementations • NAACL 2021 • Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi
It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words.
1 code implementation • EACL 2021 • Tiago Pimentel, Ryan Cotterell, Brian Roark
Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e. g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower).
1 code implementation • EMNLP 2020 • Tiago Pimentel, Naomi Saphra, Adina Williams, Ryan Cotterell
In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume.
1 code implementation • EMNLP 2020 • Tiago Pimentel, Rowan Hall Maudslay, Damián Blasi, Ryan Cotterell
For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average.
no code implementations • WS 2020 • Rowan Hall Maudslay, Tiago Pimentel, Ryan Cotterell, Simone Teufel
We report the results of our system on the Metaphor Detection Shared Task at the Second Workshop on Figurative Language Processing 2020.
1 code implementation • WS 2020 • Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
no code implementations • ACL 2020 • Elizabeth Salesky, Eleanor Chodroff, Tiago Pimentel, Matthew Wiesner, Ryan Cotterell, Alan W. black, Jason Eisner
A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions.
1 code implementation • TACL 2020 • Tiago Pimentel, Brian Roark, Ryan Cotterell
We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison.
1 code implementation • ACL 2020 • Rowan Hall Maudslay, Josef Valvoda, Tiago Pimentel, Adina Williams, Ryan Cotterell
One such probe is the structural probe (Hewitt and Manning, 2019), designed to quantify the extent to which syntactic information is encoded in contextualised word representations.
1 code implementation • ACL 2020 • Adina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell
We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender).
no code implementations • 22 Apr 2020 • Dan Valle, Tiago Pimentel, Adriano Veloso
Thus, in this work we propose an objective measure to evaluate the reliability of explanations of deep models.
1 code implementation • ACL 2020 • Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, Ryan Cotterell
The success of neural networks on a diverse set of NLP tasks has led researchers to question how much these networks actually ``know'' about natural language.
no code implementations • WS 2019 • Tiago Pimentel, Brian Roark, Ryan Cotterell
In this work, we propose the use of phone-level language models to estimate phonotactic complexity{---}measured in bits per phoneme{---}which makes cross-linguistic comparison straightforward.
1 code implementation • ACL 2019 • Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade?
no code implementations • ICLR 2019 • Tiago Pimentel, Marianne Monteiro, Juliano Viana, Adriano Veloso, Nivio Ziviani
This work presents a method for active anomaly detection which can be built upon existing deep learning solutions for unsupervised anomaly detection.
no code implementations • 23 May 2018 • Tiago Pimentel, Marianne Monteiro, Adriano Veloso, Nivio Ziviani
Anomalies are intuitively easy for human experts to understand, but they are hard to define mathematically.
no code implementations • ICLR 2018 • Tiago Pimentel, Adriano Veloso, Nivio Ziviani
Representation learning is one of the foundations of Deep Learning and allowed important improvements on several Machine Learning tasks, such as Neural Machine Translation, Question Answering and Speech Recognition.
no code implementations • 9 Jul 2017 • Divya Shah, Ernesto Denicia, Tiago Pimentel, Barbara Bruno, Fulvio Mastrogiovanni
Bimanual gestures are of the utmost importance for the study of motor coordination in humans and in everyday activities.