Search Results for author: David R. Mortensen

Found 39 papers, 16 papers with code

A Hmong Corpus with Elaborate Expression Annotations

no code implementations • LREC 2022 • David R. Mortensen, Xinyu Zhang, Chenxuan Cui, Katherine Zhang

This paper describes the first publicly available corpus of Hmong, a minority language of China, Vietnam, Laos, Thailand, and various countries in Europe and the Americas.

Word Embeddings

Paper
Add Code

Phone Inventories and Recognition for Every Language

no code implementations • LREC 2022 • Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

Paper
Add Code

WikiHan: A New Comparative Dataset for Chinese Languages

2 code implementations • COLING 2022 • Kalvin Chang, Chenxuan Cui, Youngmin Kim, David R. Mortensen

Most comparative datasets of Chinese varieties are not digital; however, Wiktionary includes a wealth of transcriptions of words from these varieties.

Decoder

Paper
Code

Neural Proto-Language Reconstruction

no code implementations • 24 Apr 2024 • Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen

Proto-form reconstruction has been a painstaking process for linguists.

Data Augmentation Machine Translation +1

Paper
Add Code

Improved Neural Protoform Reconstruction via Reflex Prediction

1 code implementation • 27 Mar 2024 • Liang Lu, Jingzhi Wang, David R. Mortensen

Protolanguage reconstruction is central to historical linguistics.

Decoder

Paper
Code

Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

1 code implementation • 26 Mar 2024 • Shijia Zhou, Leonie Weissweiler, Taiqi He, Hinrich Schütze, David R. Mortensen, Lori Levin

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias.

Paper
Code

Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

no code implementations • 26 Mar 2024 • David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler

We find that GPT-4 performs best on the task, followed by GPT-3. 5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

Natural Language Inference

Paper
Add Code

Wav2Gloss: Generating Interlinear Glossed Text from Speech

no code implementations • 19 Mar 2024 • Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity.

Paper
Add Code

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

no code implementations • 22 Feb 2024 • Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen

Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages.

Language Modelling

Paper
Add Code

Phonotactic Complexity across Dialects

1 code implementation • 20 Feb 2024 • Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen

Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012).

Language Modelling

Paper
Code

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

1 code implementation • 2 Feb 2024 • Kalvin Chang, Nathaniel R. Robinson, Anna Cai, Ting Chen, Annie Zhang, David R. Mortensen

We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes.

Paper
Code

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

no code implementations • 23 Oct 2023 • Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills.

Language Modelling Large Language Model

Paper
Add Code

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

1 code implementation • 14 Sep 2023 • Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig

Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages.

Machine Translation

Paper
Code

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

no code implementations • 23 May 2023 • Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov

Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products.

Fairness Language Modelling

Paper
Add Code

Construction Grammar Provides Unique Insight into Neural Language Models

no code implementations • 4 Feb 2023 • Leonie Weissweiler, Taiqi He, Naoki Otani, David R. Mortensen, Lori Levin, Hinrich Schütze

Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions.

Position

Paper
Add Code

Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

no code implementations • loresmt (COLING) 2022 • Nathaniel R. Robinson, Cameron J. Hogan, Nancy Fulda, David R. Mortensen

Our experiments suggest that for some languages beyond a threshold of authentic data, back-translation augmentation methods are counterproductive, while cross-lingual transfer from a sufficiently related language is preferred.

Cross-Lingual Transfer Machine Translation +2

Paper
Add Code

When Is TTS Augmentation Through a Pivot Language Useful?

1 code implementation • 20 Jul 2022 • Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Learning the Ordering of Coordinate Compounds and Elaborate Expressions in Hmong, Lahu, and Chinese

no code implementations • NAACL 2022 • Chenxuan Cui, Katherine J. Zhang, David R. Mortensen

Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) these phonological hierarchies lack a clear phonetic rationale.

Paper
Add Code

Differentiable Allophone Graphs for Language-Universal Speech Recognition

1 code implementation • 24 Jul 2021 • Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

speech-recognition Speech Recognition

Paper
Code

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments

no code implementations • 2 Apr 2021 • David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu

There is additionally interest in building language technologies for low-resource and endangered languages.

Paper
Add Code

Evaluating the Morphosyntactic Well-formedness of Generated Texts

1 code implementation • EMNLP 2021 • Adithya Pratapa, Antonios Anastasopoulos, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig, Yulia Tsvetkov

Text generation systems are ubiquitous in natural language processing applications.

Machine Translation Text Generation +1

Paper
Code

Automatic Extraction of Rules Governing Morphological Agreement

1 code implementation • EMNLP 2020 • Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig

Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data.

Cross-Lingual Transfer Descriptive

Paper
Code

Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks

2 code implementations • EACL 2021 • Jimin Sun, Hwijeen Ahn, Chan Young Park, Yulia Tsvetkov, David R. Mortensen

Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages.

Cross-Lingual Transfer Dependency Parsing +2

Paper
Code

Characterizing Sociolinguistic Variation in the Competing Vaccination Communities

no code implementations • 8 Jun 2020 • Shahan Ali Memon, Aman Tyagi, David R. Mortensen, Kathleen M. Carley

For an effective health communication, it is imperative to focus on "preference-based framing" where the preferences of the target sub-community are taken into consideration.

Misinformation

Paper
Add Code

Computerized Forward Reconstruction for Analysis in Diachronic Phonology, and Latin to French Reflex Prediction

no code implementations • LREC 2020 • Clayton Marr, David R. Mortensen

Traditionally, historical phonologists have relied on tedious manual derivations to calibrate the sequences of sound changes that shaped the phonological evolution of languages.

Paper
Add Code

AlloVera: A Multilingual Allophone Database

no code implementations • LREC 2020 • David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. black, Florian Metze, Graham Neubig

While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription.

speech-recognition Speech Recognition

Paper
Add Code

Towards Zero-shot Learning for Automatic Phonemic Transcription

no code implementations • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W. black, Florian Metze

The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.

Zero-Shot Learning

Paper
Add Code

Universal Phone Recognition with a Multilingual Allophone System

1 code implementation • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W. black, Florian Metze

Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages.

speech-recognition Speech Recognition

519

Paper
Code

Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods

1 code implementation • SCiL 2020 • Maria Ryskina, Ella Rabinovich, Taylor Berg-Kirkpatrick, David R. Mortensen, Yulia Tsvetkov

Besides presenting a new linguistic application of distributional semantics, this study tackles the linguistic question of the role of language-internal factors (in our case, sparsity) in language change motivated by language-external factors (reflected in frequency growth).

Paper
Code

Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation

no code implementations • 7 Nov 2019 • Zhong Zhou, Lori Levin, David R. Mortensen, Alex Waibel

Firstly, we pool IGT for 1, 497 languages in ODIN (54, 545 glosses) and 70, 918 glosses in Arapaho and train a gloss-to-target NMT system from IGT to English, with a BLEU score of 25. 94.

Machine Translation NMT +2

Paper
Add Code

CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology

1 code implementation • WS 2019 • Aditi Chaudhary, Elizabeth Salesky, Gayatri Bhat, David R. Mortensen, Jaime G. Carbonell, Yulia Tsvetkov

This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context.

LEMMA Lemmatization +3

Paper
Code

The ARIEL-CMU Systems for LoReHLT18

no code implementations • 24 Feb 2019 • Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W. black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra, Kathleen McKeown

This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

Machine Translation Translation

Paper
Add Code

Zero-shot Learning for Speech Recognition with Universal Phonetic Model

no code implementations • 27 Sep 2018 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Florian Metze, Alan W Black

Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available.

speech-recognition Speech Recognition +1

Paper
Add Code

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

1 code implementation • EMNLP 2018 • Aditi Chaudhary, Chunting Zhou, Lori Levin, Graham Neubig, David R. Mortensen, Jaime G. Carbonell

Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging.

Avg Machine Translation +6

Paper
Code

Epitran: Precision G2P for Many Languages

no code implementations • LREC 2018 • David R. Mortensen, Siddharth Dalmia, Patrick Littell

Entity Linking

Paper
Add Code

URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors

no code implementations • EACL 2017 • Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, Lori Levin

We introduce the URIEL knowledge base for massively multilingual NLP and the lang2vec utility, which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases and normalized to have straightforward and consistent formats, naming, and semantics.

Language Identification Language Modelling +1

Paper
Add Code

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors

1 code implementation • COLING 2016 • David R. Mortensen, Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, Lori Levin

This paper contributes to a growing body of evidence that{---}when coupled with appropriate machine-learning techniques{--}linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data.

NER

200

Paper
Code

Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik

no code implementations • COLING 2016 • Patrick Littell, Kartik Goyal, David R. Mortensen, Alexa Little, Chris Dyer, Lori Levin

This paper describes our construction of named-entity recognition (NER) systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of {``}Linguistic Rapid Response{''} to potential emergency humanitarian relief situations.

Humanitarian named-entity-recognition +2

Paper
Add Code

Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik

no code implementations • LREC 2016 • Patrick Littell, David R. Mortensen, Kartik Goyal, Chris Dyer, Lori Levin

In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition {--} capitalization {--} is absent, as the language{'}s Perso-Arabic script does not make a distinction between uppercase and lowercase letters.

named-entity-recognition Named Entity Recognition +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.