no code implementations • 16 Mar 2024 • Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi
This paper measures the skew in how well two families of LLMs represent diverse geographic populations.
1 code implementation • 14 Mar 2024 • Jonathan Dunn, Lane Edwards-Brown
The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.
no code implementations • 13 Mar 2024 • Jonathan Dunn
The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations.
no code implementations • 21 Sep 2023 • Jonathan Dunn
While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar.
no code implementations • 21 Aug 2023 • Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams
This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i. e., social media language data) and real-world spaces in Aotearoa New Zealand (i. e., subnational administrative areas).
no code implementations • 20 Aug 2023 • Sidney G. -J. Wong, Matthew Durward, Benjamin Adams, Jonathan Dunn
We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data.
no code implementations • 27 Mar 2023 • Jonathan Dunn
This paper shows that differences in embeddings across varieties are significantly higher than baseline instability.
no code implementations • 30 Jan 2023 • Jonathan Dunn
Recent work has formulated the task for computational construction grammar as producing a constructicon given a corpus of usage.
no code implementations • 25 Nov 2022 • Jonathan Dunn
This paper uses computational experiments to explore the role of exposure in the emergence of construction grammars.
1 code implementation • 20 Sep 2022 • Haipeng Li, Jonathan Dunn, Andrea Nini
In this paper, the universality and robustness of register variation is tested by comparing variation within vs. between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles.
no code implementations • COLING 2022 • Jonathan Dunn, Sidney Wong
And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.
1 code implementation • 9 Jun 2022 • Haipeng Li, Jonathan Dunn
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
1 code implementation • LREC 2022 • Jonathan Dunn, Wikke Nijhof
This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages.
1 code implementation • LREC 2022 • Jonathan Dunn, Haipeng Li, Damian Sastre
The goal is to use corpus similarity measures before training to predict properties of embeddings after training.
no code implementations • CoNLL (EMNLP) 2021 • Jonathan Dunn, Harish Tayyar Madabushi
These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars.
no code implementations • NAACL (CMCL) 2021 • Jonathan Dunn, Andrea Nini
This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus.
no code implementations • 3 Apr 2021 • Jonathan Dunn
Results show that themethod (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of Englishwith a high degree of accuracy, thus (3) supporting dialectometricmethods formeasuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation.
1 code implementation • 3 Apr 2021 • Jonathan Dunn
This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise {\Delta}P measure, that are able to quantify association in sequences of varying length and type of representation.
no code implementations • EMNLP (NLP+CSS) 2020 • Jonathan Dunn, Tom Coupe, Benjamin Adams
Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data.
1 code implementation • EACL (VarDial) 2021 • Jonathan Dunn
This paper measures similarity both within and between 84 language varieties across nine languages.
no code implementations • 3 Apr 2021 • Jonathan Dunn
The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale.
no code implementations • LREC 2020 • Jonathan Dunn, Ben Adams
While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics.
no code implementations • 2 Apr 2020 • Jonathan Dunn, Ben Adams
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets.
no code implementations • 2 Apr 2020 • Jonathan Dunn
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
1 code implementation • WS 2018 • Jonathan Dunn
This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar).
no code implementations • WS 2019 • Jonathan Dunn
A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions.
no code implementations • WS 2019 • Jonathan Dunn
This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation.