Search Results for author: Jonathan Dunn

Found 30 papers, 8 papers with code

Pre-Trained Language Models Represent Some Geographic Populations Better Than Others

no code implementations • 16 Mar 2024 • Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi

This paper measures the skew in how well two families of LLMs represent diverse geographic populations.

Paper
Add Code

Geographically-Informed Language Identification

1 code implementation • 14 Mar 2024 • Jonathan Dunn, Lane Edwards-Brown

The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

Language Identification

Paper
Code

Validating and Exploring Large Geographic Corpora

no code implementations • 13 Mar 2024 • Jonathan Dunn

The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations.

Language Identification Outlier Detection

Paper
Add Code

Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System

no code implementations • 21 Sep 2023 • Jonathan Dunn

While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar.

Paper
Add Code

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

no code implementations • 21 Aug 2023 • Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams

This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i. e., social media language data) and real-world spaces in Aotearoa New Zealand (i. e., subnational administrative areas).

Paper
Add Code

cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

no code implementations • 20 Aug 2023 • Sidney G. -J. Wong, Matthew Durward, Benjamin Adams, Jonathan Dunn

We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data.

Classification Language Modelling

Paper
Add Code

Variation and Instability in Dialect-Based Embedding Spaces

no code implementations • 27 Mar 2023 • Jonathan Dunn

This paper shows that differences in embeddings across varieties are significantly higher than baseline instability.

Paper
Add Code

Exploring the Constructicon: Linguistic Analysis of a Computational CxG

no code implementations • 30 Jan 2023 • Jonathan Dunn

Recent work has formulated the task for computational construction grammar as producing a constructicon given a corpus of usage.

Paper
Add Code

Exposure and Emergence in Usage-Based Grammar: Computational Experiments in 35 Languages

no code implementations • 25 Nov 2022 • Jonathan Dunn

This paper uses computational experiments to explore the role of exposure in the emergence of construction grammars.

Paper
Add Code

Register Variation Remains Stable Across 60 Languages

1 code implementation • 20 Sep 2022 • Haipeng Li, Jonathan Dunn, Andrea Nini

In this paper, the universality and robustness of register variation is tested by comparing variation within vs. between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles.

Paper
Code

Stability of Syntactic Dialect Classification Over Space and Time

no code implementations • COLING 2022 • Jonathan Dunn, Sidney Wong

And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.

text-classification Text Classification

Paper
Add Code

Corpus Similarity Measures Remain Robust Across Diverse Languages

1 code implementation • 9 Jun 2022 • Haipeng Li, Jonathan Dunn

This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.

Paper
Code

Language Identification for Austronesian Languages

1 code implementation • LREC 2022 • Jonathan Dunn, Wikke Nijhof

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages.

Language Identification

Paper
Code

Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures

1 code implementation • LREC 2022 • Jonathan Dunn, Haipeng Li, Damian Sastre

The goal is to use corpus similarity measures before training to predict properties of embeddings after training.

Paper
Code

Learned Construction Grammars Converge Across Registers Given Increased Exposure

no code implementations • CoNLL (EMNLP) 2021 • Jonathan Dunn, Harish Tayyar Madabushi

These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars.

Paper
Add Code

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

no code implementations • NAACL (CMCL) 2021 • Jonathan Dunn, Andrea Nini

This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus.

Paper
Add Code

Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs

no code implementations • 3 Apr 2021 • Jonathan Dunn

Results show that themethod (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of Englishwith a high degree of accuracy, thus (3) supporting dialectometricmethods formeasuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation.

Paper
Add Code

Multi-Unit Directional Measures of Association: Moving Beyond Pairs of Words

1 code implementation • 3 Apr 2021 • Jonathan Dunn

This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise {\Delta}P measure, that are able to quantify association in sequences of varying length and type of representation.

Segmentation

Paper
Code

Measuring Linguistic Diversity During COVID-19

no code implementations • EMNLP (NLP+CSS) 2020 • Jonathan Dunn, Tom Coupe, Benjamin Adams

Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data.

Paper
Add Code

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

1 code implementation • EACL (VarDial) 2021 • Jonathan Dunn

This paper measures similarity both within and between 84 language varieties across nine languages.

Paper
Code

Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology

no code implementations • 3 Apr 2021 • Jonathan Dunn

The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale.

Paper
Add Code

Geographically-Balanced Gigaword Corpora for 50 Language Varieties

no code implementations • LREC 2020 • Jonathan Dunn, Ben Adams

While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics.

Word Embeddings

Paper
Add Code

Mapping Languages and Demographics with Georeferenced Corpora

no code implementations • 2 Apr 2020 • Jonathan Dunn, Ben Adams

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets.

Paper
Add Code

Mapping Languages: The Corpus of Global Language Use

no code implementations • 2 Apr 2020 • Jonathan Dunn

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.

Language Identification

Paper
Add Code

Modeling the Complexity and Descriptive Adequacy of Construction Grammars

1 code implementation • WS 2018 • Jonathan Dunn

This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar).

Descriptive