Search Results for author: Arya D. McCarthy

Found 39 papers, 9 papers with code

Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages

no code implementations • COLING 2022 • Georgie Botev, Arya D. McCarthy, Winston Wu, David Yarowsky

This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks.

Language Modelling Machine Translation +1

Paper
Add Code

Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering

no code implementations • ACL (SIGMORPHON) 2021 • Adam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai, Manex Agirrezabal, Miikka Silfverberg, Mans Hulden, Katharina Kann

We describe the second SIGMORPHON shared task on unsupervised morphology: the goal of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering is to cluster word types from a raw text corpus into paradigms.

Clustering

Paper
Add Code

Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions

no code implementations • EMNLP 2020 • Arya D. McCarthy, Adina Williams, Shijia Liu, David Yarowsky, Ryan Cotterell

Of particular interest, languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.

Community Detection

Paper
Add Code

Jump-Starting Item Parameters for Adaptive Language Tests

no code implementations • EMNLP 2021 • Arya D. McCarthy, Kevin P. Yancey, Geoff T. LaFlair, Jesse Egbert, Manqian Liao, Burr Settles

A challenge in designing high-stakes language assessments is calibrating the test item difficulties, either a priori or from limited pilot test data.

Language Acquisition Multi-Task Learning +1

Paper
Add Code

A Mixed-Methods Analysis of Western and Hong Kong–based Reporting on the 2019–2020 Protests

no code implementations • EMNLP (LaTeCHCLfL, CLFL, LaTeCH) 2021 • Arya D. McCarthy, James Scharf, Giovanna Maria Dora Dore

We apply statistical techniques from natural language processing to Western and Hong Kong–based English language newspaper articles that discuss the 2019–2020 Hong Kong protests of the Anti-Extradition Law Amendment Bill Movement.

Sentiment Analysis

Paper
Add Code

Characterizing News Portrayal of Civil Unrest in Hong Kong, 1998–2020

no code implementations • ACL (CASE) 2021 • James Scharf, Arya D. McCarthy, Giovanna Maria Dora Dore

We apply statistical techniques from natural language processing to a collection of Western and Hong Kong–based English-language newspaper articles spanning the years 1998–2020, studying the difference and evolution of its portrayal.

Paper
Add Code

Hong Kong: Longitudinal and Synchronic Characterisations of Protest News between 1998 and 2020

no code implementations • LREC 2022 • Arya D. McCarthy, Giovanna Maria Dora Dore

This paper showcases the utility and timeliness of the Hong Kong Protest News Dataset, a highly curated collection of news articles from diverse news sources, to investigate longitudinal and synchronic news characterisations of protests in Hong Kong between 1998 and 2020.

Paper
Add Code

FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

1 code implementation • 2 Apr 2024 • Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, Christopher Manning

In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples.

Decision Making Legal Reasoning

Paper
Code

Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models

no code implementations • 20 Oct 2023 • Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Ke wu

One challenge in speech translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations.

Hallucination Translation

Paper
Add Code

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

1 code implementation • 15 Feb 2023 • Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John E. Ortega, Gustavo A. Giménez-Lugo, Rolando Coto-Solano, Katharina Kann

However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data.

named-entity-recognition Named Entity Recognition +3

Paper
Code

Improved Long-Form Spoken Language Translation with Large Language Models

no code implementations • 19 Dec 2022 • Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Axel H. Ng

A challenge in spoken language translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations.

Language Modelling Large Language Model +1

Paper
Add Code

A Major Obstacle for NLP Research: Let's Talk about Time Allocation!

no code implementations • 30 Nov 2022 • Katharina Kann, Shiran Dudy, Arya D. McCarthy

The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products.

Paper
Add Code

UniMorph 4.0: Universal Morphology

no code implementations • LREC 2022 • Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

Paper
Add Code

Morphological Processing of Low-Resource Languages: Where We Are and What's Next

no code implementations • 16 Mar 2022 • Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages.

Paper
Add Code

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

no code implementations • Findings (ACL) 2022 • En-Shiun Annie Lee, Sarubi Thillainathan, Shravan Nayak, Surangika Ranathunga, David Ifeoluwa Adelani, Ruisi Su, Arya D. McCarthy

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages?

Machine Translation Translation

Paper
Add Code

On the Uncomputability of Partition Functions in Energy-Based Sequence Models

no code implementations • ICLR 2022 • Chu-Cheng Lin, Arya D. McCarthy

In this paper, we argue that energy-based sequence models backed by expressive parametric families can result in uncomputable and inapproximable partition functions.

Model Selection

Paper
Add Code

AirWare: Utilizing Embedded Audio and Infrared Signals for In-Air Hand-Gesture Recognition

no code implementations • 25 Jan 2021 • Nibhrat Lohia, Raunak Mundada, Arya D. McCarthy, Eric C. Larson

We introduce AirWare, an in-air hand-gesture recognition system that uses the already embedded speaker and microphone in most electronic devices, together with embedded infrared proximity sensors.

Hand Gesture Recognition Hand-Gesture Recognition Human-Computer Interaction

Paper
Add Code

Neural Transduction for Multilingual Lexical Translation

no code implementations • COLING 2020 • Dylan Lewis, Winston Wu, Arya D. McCarthy, David Yarowsky

We present a method for completing multilingual translation dictionaries.

Translation

Paper
Add Code

The human unlikeness of neural language models in next-word prediction

no code implementations • WS 2020 • Cass Jacobs, ra L., Arya D. McCarthy

The training objective of unidirectional language models (LMs) is similar to a psycholinguistic benchmark known as the cloze task, which measures next-word predictability.

Paper
Add Code

The JHU Submission to the 2020 Duolingo Shared Task on Simultaneous Translation and Paraphrase for Language Education

no code implementations • WS 2020 • Huda Khayrallah, Jacob Bremerman, Arya D. McCarthy, Kenton Murray, Winston Wu, Matt Post

This paper presents the Johns Hopkins University submission to the 2020 Duolingo Shared Task on Simultaneous Translation and Paraphrase for Language Education (STAPLE).

Machine Translation Translation

Paper
Add Code

Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation

no code implementations • ACL 2020 • Arya D. McCarthy, Xi-An Li, Jiatao Gu, Ning Dong

This paper proposes a simple and effective approach to address the problem of posterior collapse in conditional variational autoencoders (CVAEs).

Machine Translation NMT +1

Paper
Add Code

Unsupervised Morphological Paradigm Completion

1 code implementation • ACL 2020 • Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann

We propose the task of unsupervised morphological paradigm completion.

LEMMA Retrieval

Paper
Code

Predicting Declension Class from Form and Meaning

1 code implementation • ACL 2020 • Adina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell

We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender).

Paper
Code

Massively Multilingual Pronunciation Modeling with WikiPron

no code implementations • LREC 2020 • Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary.

Paper
Add Code

The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration

no code implementations • LREC 2020 • Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron Mueller, Winston Wu, Oliver Adams, Garrett Nicolai, Matt Post, David Yarowsky

The corpus consists of over 4000 unique translations of the Christian Bible and counting.

Paper
Add Code

Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages

no code implementations • LREC 2020 • Garrett Nicolai, Dylan Lewis, Arya D. McCarthy, Aaron Mueller, Winston Wu, David Yarowsky

Exploiting the broad translation of the Bible into the world{'}s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology.

Translation

Paper
Add Code

An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages

no code implementations • LREC 2020 • Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, David Yarowsky

We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance{---}the best number depends on the source language.

Low-Resource Neural Machine Translation Translation

Paper
Add Code

UniMorph 3.0: Universal Morphology

no code implementations • LREC 2020 • Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, Timofey Arkhangelskiy, Nataly Krizhanovsky, Andrew Krizhanovsky, Elena Klyachko, Alexey Sorokin, John Mansfield, Valts Ern{\v{s}}treits, Yuval Pinter, Cass Jacobs, ra L., Ryan Cotterell, Mans Hulden, David Yarowsky

Paper
Add Code

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

1 code implementation • 27 Feb 2020 • Arya D. McCarthy, Liezl Puzon, Juan Pino

Our method compares favorably to SpecAugment on English$\to$French and English$\to$Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Code

Weird Inflects but OK: Making Sense of Morphological Generation Errors

no code implementations • CONLL 2019 • Kyle Gorman, Arya D. McCarthy, Ryan Cotterell, Ekaterina Vylomova, Miikka Silfverberg, Magdalena Markowska

We conduct a manual error analysis of the CoNLL-SIGMORPHON Shared Task on Morphological Reinflection.

Text Generation

Paper
Add Code

The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

no code implementations • WS 2019 • Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden

The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages.

Cross-Lingual Transfer Lemmatization +3

Paper
Add Code

Modeling Color Terminology Across Thousands of Languages

1 code implementation • IJCNLP 2019 • Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky

There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969).

Paper
Code

Improved Variational Neural Machine Translation by Promoting Mutual Information

no code implementations • 19 Sep 2019 • Arya D. McCarthy, Xi-An Li, Jiatao Gu, Ning Dong

Posterior collapse plagues VAEs for text, especially for conditional text generation with strong autoregressive decoders.

Conditional Text Generation Decoder +2

Paper
Add Code

Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

no code implementations • EMNLP (IWSLT) 2019 • Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak Gopinath

In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Meaning to Form: Measuring Systematicity as Information

1 code implementation • ACL 2019 • Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade?

Paper
Code

UniMorph 2.0: Universal Morphology

3 code implementations • LREC 2018 • Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages.

LEMMA

Paper
Code

The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

no code implementations • CONLL 2018 • Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task.

LEMMA Task 2

Paper
Add Code

Marrying Universal Dependencies and Universal Morphology

no code implementations • WS 2018 • Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language.

Paper
Add Code

Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

1 code implementation • WS 2018 • Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya D. McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, Philipp Koehn

To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation.

Decoder Domain Adaptation +2

1,207

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.