CodE Alltag 2.0 --- A Pseudonymized German-Language Email Corpus
The vast amount of social communication distributed over various electronic media channels (tweets, blogs, emails, etc.), so-called user-generated content (UGC), creates entirely new opportunities for today{'}s NLP research. Yet, data privacy concerns implied by the unauthorized use of these text streams as a data resource are often neglected. In an attempt to reconciliate the diverging needs of unconstrained raw data use and preservation of data privacy in digital communication, we here investigate the automatic recognition of privacy-sensitive stretches of text in UGC and provide an algorithmic solution for the protection of personal data via pseudonymization. Our focus is directed at the de-identification of emails where personally identifying information does not only refer to the sender but also to those people, locations, dates, and other identifiers mentioned in greetings, boilerplates and the content-carrying body of emails. We evaluate several de-identification procedures and systems on two hitherto non-anonymized German-language email corpora (CodE AlltagS+d and CodE AlltagXL), and generate fully pseudonymized versions for both (CodE Alltag 2.0) in which personally identifying information of all social actors addressed in these mails has been camouflaged (to the greatest extent possible).
PDF Abstract