A Dataset and Classifier for Recognizing Social Media English

WS 2017  ·  Su Lin Blodgett, Johnny Wei, Brendan O{'}Connor ·

While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language{---}even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model{---}which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter{---}can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here