Inducing a lexicon of sociolinguistic variables from code-mixed text

Sociolinguistics is often concerned with how variants of a linguistic item (e.g., \textit{nothing} vs. \textit{nothin{'}}) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (\textit{football}, \textit{fitba}) along with their linguistic code (\textit{football}→British, \textit{fitba}→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70{\%} for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here