no code implementations • loresmt (COLING) 2022 • Winston Wu, David Yarowsky
Translating into low-resource languages is challenging due to the scarcity of training data.
no code implementations • COLING 2022 • Georgie Botev, Arya D. McCarthy, Winston Wu, David Yarowsky
This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks.
no code implementations • LREC 2022 • Winston Wu, David Yarowsky
We evaluate two popular neural cognate generation models’ robustness to several types of human-plausible noise (deletion, duplication, swapping, and keyboard errors, as well as a new type of error, phonological errors).
1 code implementation • AMTA 2016 • John Hewitt, Matt Post, David Yarowsky
Statistical Machine Translation (SMT) of highly inflected, low-resource languages suffers from the problem of low bitext availability, which is exacerbated by large inflectional paradigms.
no code implementations • EMNLP 2020 • Arya D. McCarthy, Adina Williams, Shijia Liu, David Yarowsky, Ryan Cotterell
Of particular interest, languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.
no code implementations • RANLP (BUCC) 2021 • Winston Wu, David Yarowsky
We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5. 3 million IPA pronunciations, the largest pronunciation lexicon of its kind.
no code implementations • ACL (SIGMORPHON) 2021 • Tiago Pimentel, Maria Ryskina, Sabrina J. Mielke, Shijie Wu, Eleanor Chodroff, Brian Leonard, Garrett Nicolai, Yustinus Ghanggo Ate, Salam Khalifa, Nizar Habash, Charbel El-Khaissi, Omer Goldman, Michael Gasser, William Lane, Matt Coler, Arturo Oncevay, Jaime Rafael Montoya Samame, Gema Celeste Silva Villegas, Adam Ek, Jean-Philippe Bernardy, Andrey Shcherbakov, Aziyana Bayyr-ool, Karina Sheifer, Sofya Ganieva, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Andrew Krizhanovsky, Natalia Krizhanovsky, Clara Vania, Sardana Ivanova, Aelita Salchak, Christopher Straughn, Zoey Liu, Jonathan North Washington, Duygu Ataman, Witold Kieraś, Marcin Woliński, Totok Suhardijanto, Niklas Stoehr, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Richard J. Hatcher, Emily Prud'hommeaux, Ritesh Kumar, Mans Hulden, Botond Barta, Dorina Lakatos, Gábor Szolnok, Judit Ács, Mohit Raj, David Yarowsky, Ryan Cotterell, Ben Ambridge, Ekaterina Vylomova
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features.
no code implementations • LREC 2022 • Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova
The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.
no code implementations • COLING 2020 • Winston Wu, David Yarowsky
We extend the Yawipa Wiktionary Parser (Wu and Yarowsky, 2020) to extract and normalize translations from etymology glosses, and morphological form-of relations, resulting in 300K unique translations and over 4 million instances of 168 annotated morphological relations.
no code implementations • COLING 2020 • Dylan Lewis, Winston Wu, Arya D. McCarthy, David Yarowsky
We present a method for completing multilingual translation dictionaries.
no code implementations • LREC 2020 • Garrett Nicolai, Dylan Lewis, Arya D. McCarthy, Aaron Mueller, Winston Wu, David Yarowsky
Exploiting the broad translation of the Bible into the world{'}s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology.
no code implementations • LREC 2020 • Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, Timofey Arkhangelskiy, Nataly Krizhanovsky, Andrew Krizhanovsky, Elena Klyachko, Alexey Sorokin, John Mansfield, Valts Ern{\v{s}}treits, Yuval Pinter, Cass Jacobs, ra L., Ryan Cotterell, Mans Hulden, David Yarowsky
The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.
no code implementations • LREC 2020 • Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu, David Yarowsky
We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance{---}the best number depends on the source language.
no code implementations • LREC 2020 • Arya D. McCarthy, Rachel Wicks, Dylan Lewis, Aaron Mueller, Winston Wu, Oliver Adams, Garrett Nicolai, Matt Post, David Yarowsky
The corpus consists of over 4000 unique translations of the Christian Bible and counting.
no code implementations • LREC 2020 • Winston Wu, David Yarowsky
We developed an extensible, comprehensive Wiktionary parser that improves over several existing parsers.
no code implementations • LREC 2020 • Winston Wu, Garrett Nicolai, David Yarowsky
We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries.
1 code implementation • WS 2020 • Oliver Adams, Matthew Wiesner, Jan Trmal, Garrett Nicolai, David Yarowsky
We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants.
1 code implementation • IJCNLP 2019 • Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky
There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969).
no code implementations • ACL 2019 • Garrett Nicolai, David Yarowsky
A large percentage of computational tools are concentrated in a very small subset of the planet{'}s languages.
no code implementations • NAACL 2019 • Oliver Adams, Matthew Wiesner, Shinji Watanabe, David Yarowsky
We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages.
3 code implementations • LREC 2018 • Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages.
no code implementations • CONLL 2018 • Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden
Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task.
no code implementations • WS 2018 • Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language.
1 code implementation • IJCNLP 2017 • Patrick Xia, David Yarowsky
We present a method which generates a single consensus between multi-parallel corpora.
no code implementations • EMNLP 2017 • Ryan Cotterell, Ekaterina Vylomova, Huda Khayrallah, Christo Kirov, David Yarowsky
The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task.
no code implementations • CONLL 2017 • Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden
In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms.
no code implementations • LREC 2016 • John Sylak-Glassman, Christo Kirov, David Yarowsky
We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language.
no code implementations • LREC 2016 • Christo Kirov, John Sylak-Glassman, Roger Que, David Yarowsky
Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation.
no code implementations • 5 Mar 2016 • Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, Ting Liu
Cross-lingual model transfer has been a promising approach for inducing dependency parsers for low-resource languages where annotated treebanks are not available.
Cross-lingual zero-shot dependency parsing
Representation Learning