Unsupervised Discovery of Biographical Structure from Text
We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as Born, Graduates High School, and Becomes Citizen), along with the typical times in a person{'}s life when those events occur. In a quantitative evaluation at the task of predicting a person{'}s age for a given event, we find that our generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features. The abstract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia{---}not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011){---}we find that there is a bias in their characterization as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.
PDF Abstract