Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing

LREC 2014 · Noushin Rezapour Asheghi, Serge Sharoff, Katja Markert ·

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories{'} chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.

PDF Abstract