Search Results for author: Jonathan Dunn

Found 30 papers, 8 papers with code

Pre-Trained Language Models Represent Some Geographic Populations Better Than Others

no code implementations16 Mar 2024 Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi

This paper measures the skew in how well two families of LLMs represent diverse geographic populations.

Geographically-Informed Language Identification

1 code implementation14 Mar 2024 Jonathan Dunn, Lane Edwards-Brown

The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

Language Identification

Validating and Exploring Large Geographic Corpora

no code implementations13 Mar 2024 Jonathan Dunn

The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations.

Language Identification Outlier Detection

Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System

no code implementations21 Sep 2023 Jonathan Dunn

While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar.

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

no code implementations21 Aug 2023 Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams

This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i. e., social media language data) and real-world spaces in Aotearoa New Zealand (i. e., subnational administrative areas).

cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

no code implementations20 Aug 2023 Sidney G. -J. Wong, Matthew Durward, Benjamin Adams, Jonathan Dunn

We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data.

Classification Language Modelling

Variation and Instability in Dialect-Based Embedding Spaces

no code implementations27 Mar 2023 Jonathan Dunn

This paper shows that differences in embeddings across varieties are significantly higher than baseline instability.

Exploring the Constructicon: Linguistic Analysis of a Computational CxG

no code implementations30 Jan 2023 Jonathan Dunn

Recent work has formulated the task for computational construction grammar as producing a constructicon given a corpus of usage.

Exposure and Emergence in Usage-Based Grammar: Computational Experiments in 35 Languages

no code implementations25 Nov 2022 Jonathan Dunn

This paper uses computational experiments to explore the role of exposure in the emergence of construction grammars.

Register Variation Remains Stable Across 60 Languages

1 code implementation20 Sep 2022 Haipeng Li, Jonathan Dunn, Andrea Nini

In this paper, the universality and robustness of register variation is tested by comparing variation within vs. between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles.

Stability of Syntactic Dialect Classification Over Space and Time

no code implementations COLING 2022 Jonathan Dunn, Sidney Wong

And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.

text-classification Text Classification

Corpus Similarity Measures Remain Robust Across Diverse Languages

1 code implementation9 Jun 2022 Haipeng Li, Jonathan Dunn

This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.

Language Identification for Austronesian Languages

1 code implementation LREC 2022 Jonathan Dunn, Wikke Nijhof

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages.

Language Identification

Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures

1 code implementation LREC 2022 Jonathan Dunn, Haipeng Li, Damian Sastre

The goal is to use corpus similarity measures before training to predict properties of embeddings after training.

Learned Construction Grammars Converge Across Registers Given Increased Exposure

no code implementations CoNLL (EMNLP) 2021 Jonathan Dunn, Harish Tayyar Madabushi

These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars.

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

no code implementations NAACL (CMCL) 2021 Jonathan Dunn, Andrea Nini

This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus.

Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs

no code implementations3 Apr 2021 Jonathan Dunn

Results show that themethod (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of Englishwith a high degree of accuracy, thus (3) supporting dialectometricmethods formeasuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation.

Multi-Unit Directional Measures of Association: Moving Beyond Pairs of Words

1 code implementation3 Apr 2021 Jonathan Dunn

This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise {\Delta}P measure, that are able to quantify association in sequences of varying length and type of representation.

Segmentation

Measuring Linguistic Diversity During COVID-19

no code implementations EMNLP (NLP+CSS) 2020 Jonathan Dunn, Tom Coupe, Benjamin Adams

Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data.

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

1 code implementation EACL (VarDial) 2021 Jonathan Dunn

This paper measures similarity both within and between 84 language varieties across nine languages.

Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology

no code implementations3 Apr 2021 Jonathan Dunn

The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale.

Geographically-Balanced Gigaword Corpora for 50 Language Varieties

no code implementations LREC 2020 Jonathan Dunn, Ben Adams

While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics.

Word Embeddings

Mapping Languages and Demographics with Georeferenced Corpora

no code implementations2 Apr 2020 Jonathan Dunn, Ben Adams

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets.

Mapping Languages: The Corpus of Global Language Use

no code implementations2 Apr 2020 Jonathan Dunn

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.

Language Identification

Modeling the Complexity and Descriptive Adequacy of Construction Grammars

1 code implementation WS 2018 Jonathan Dunn

This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar).

Descriptive

Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar

no code implementations WS 2019 Jonathan Dunn

A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions.

Modeling Global Syntactic Variation in English Using Dialect Classification

no code implementations WS 2019 Jonathan Dunn

This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation.

Classification Dialect Identification +1

Cannot find the paper you are looking for? You can Submit a new open access paper.