Search Results for author: Brian Roark

Found 39 papers, 8 papers with code

Criteria for Useful Automatic Romanization in South Asian Languages

no code implementations LREC 2022 Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark

This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization.

Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities

no code implementations LREC 2022 Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark

The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems.


Design principles of an open-source language modeling microservice package for AAC text-entry applications

no code implementations SLPAT (ACL) 2022 Brian Roark, Alexander Gutkin

We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.

Language Modelling

Spelling convention sensitivity in neural language models

no code implementations6 Mar 2023 Elizabeth Nielsen, Christo Kirov, Brian Roark

Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency.

Language Modelling

Beyond Arabic: Software for Perso-Arabic Script Manipulation

1 code implementation26 Jan 2023 Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat

This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.


Graphemic Normalization of the Perso-Arabic Script

1 code implementation21 Oct 2022 Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat

Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions.

Language Modelling Machine Translation

Structured abbreviation expansion in context

no code implementations Findings (EMNLP) 2021 Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages.

Spelling Correction

Finding Concept-specific Biases in Form--Meaning Associations

2 code implementations NAACL 2021 Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi

It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words.

Disambiguatory Signals are Stronger in Word-initial Positions

1 code implementation EACL 2021 Tiago Pimentel, Ryan Cotterell, Brian Roark

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e. g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower).


Phonotactic Complexity and its Trade-offs

1 code implementation TACL 2020 Tiago Pimentel, Brian Roark, Ryan Cotterell

We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison.

Language-agnostic Multilingual Modeling

no code implementations20 Apr 2020 Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark

Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model.

speech-recognition Speech Recognition +1

Distilling weighted finite automata from arbitrary probabilistic models

no code implementations WS 2019 An Suresh, a Theertha, Brian Roark, Michael Riley, Vlad Schogol

Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space.

Latin script keyboards for South Asian languages with finite-state normalization

no code implementations WS 2019 Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark, Michael Riley

The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script.


Rethinking Phonotactic Complexity

no code implementations WS 2019 Tiago Pimentel, Brian Roark, Ryan Cotterell

In this work, we propose the use of phone-level language models to estimate phonotactic complexity{---}measured in bits per phoneme{---}which makes cross-linguistic comparison straightforward.

Meaning to Form: Measuring Systematicity as Information

1 code implementation ACL 2019 Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade?

What Kind of Language Is Hard to Language-Model?

no code implementations ACL 2019 Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Language Modelling Sentence

Neural Models of Text Normalization for Speech Applications

no code implementations CL 2019 Hao Zhang, Richard Sproat, Axel H. Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, Brian Roark

One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS).

BIG-bench Machine Learning Speech Synthesis +1

Approximating probabilistic models as weighted finite automata

no code implementations CL (ACL) 2021 Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol

Weighted finite automata (WFA) are often used to represent probabilistic models, such as $n$-gram language models, since they are efficient for recognition tasks in time and space.

Are All Languages Equally Hard to Language-Model?

no code implementations NAACL 2018 Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Brian Roark

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles?

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.