Search Results for author: David Yarowsky

Found 45 papers, 9 papers with code

Automatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages

1 code implementation AMTA 2016 John Hewitt, Matt Post, David Yarowsky

Statistical Machine Translation (SMT) of highly inflected, low-resource languages suffers from the problem of low bitext availability, which is exacerbated by large inflectional paradigms.

Machine Translation Translation

Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages

no code implementations COLING 2022 Georgie Botev, Arya D. McCarthy, Winston Wu, David Yarowsky

This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks.

Language Modelling Machine Translation +1

On the Robustness of Cognate Generation Models

no code implementations LREC 2022 Winston Wu, David Yarowsky

We evaluate two popular neural cognate generation models’ robustness to several types of human-plausible noise (deletion, duplication, swapping, and keyboard errors, as well as a new type of error, phonological errors).

Vocal Bursts Type Prediction

On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction

no code implementations RANLP (BUCC) 2021 Winston Wu, David Yarowsky

We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5. 3 million IPA pronunciations, the largest pronunciation lexicon of its kind.

Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions

no code implementations EMNLP 2020 Arya D. McCarthy, Adina Williams, Shijia Liu, David Yarowsky, Ryan Cotterell

Of particular interest, languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.

Community Detection

Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!

no code implementations16 Mar 2024 Niyati Bafna, Philipp Koehn, David Yarowsky

While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it.

Machine Translation NMT

UniMorph 4.0: Universal Morphology

no code implementations LREC 2022 Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

Wiktionary Normalization of Translations and Morphological Information

no code implementations COLING 2020 Winston Wu, David Yarowsky

We extend the Yawipa Wiktionary Parser (Wu and Yarowsky, 2020) to extract and normalize translations from etymology glosses, and morphological form-of relations, resulting in 300K unique translations and over 4 million instances of 168 annotated morphological relations.


Computational Etymology and Word Emergence

no code implementations LREC 2020 Winston Wu, David Yarowsky

We developed an extensible, comprehensive Wiktionary parser that improves over several existing parsers.

Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages

no code implementations LREC 2020 Garrett Nicolai, Dylan Lewis, Arya D. McCarthy, Aaron Mueller, Winston Wu, David Yarowsky

Exploiting the broad translation of the Bible into the world{'}s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology.


Multilingual Dictionary Based Construction of Core Vocabulary

no code implementations LREC 2020 Winston Wu, Garrett Nicolai, David Yarowsky

We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries.

Cognate Prediction Machine Translation +1

An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages

no code implementations LREC 2020 Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, David Yarowsky

We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance{---}the best number depends on the source language.

Low Resource Neural Machine Translation Low-Resource Neural Machine Translation +1

Induced Inflection-Set Keyword Search in Speech

1 code implementation WS 2020 Oliver Adams, Matthew Wiesner, Jan Trmal, Garrett Nicolai, David Yarowsky

We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants.

Modeling Color Terminology Across Thousands of Languages

1 code implementation IJCNLP 2019 Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky

There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969).

The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

no code implementations CONLL 2018 Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task.

LEMMA Task 2

Marrying Universal Dependencies and Universal Morphology

no code implementations WS 2018 Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language.

Paradigm Completion for Derivational Morphology

no code implementations EMNLP 2017 Ryan Cotterell, Ekaterina Vylomova, Huda Khayrallah, Christo Kirov, David Yarowsky

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task.

Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms

no code implementations LREC 2016 Christo Kirov, John Sylak-Glassman, Roger Que, David Yarowsky

Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation.

Machine Translation Morphological Analysis +1

Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages

no code implementations LREC 2016 John Sylak-Glassman, Christo Kirov, David Yarowsky

We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language.

Morphological Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.