Twitter PoS VCB (Twitter part-of-speech vote-constrained-bootstrapping)

Introduced by Derczynski et al. in Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.

Homepage