Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets
As the wealth of biomedical knowledge in the form of literature increases, there is a rising need for effective natural language processing tools to assist in organizing, curating, and retrieving this information. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals. In recent years, much attention has been focused on the problem of recognizing gene and protein mentions in biomedical abstracts. This paper presents a framework for simultaneously recognizing occurrences of PROTEIN, DNA, RNA, CELL-LINE, and CELL-TYPE entity classes using Conditional Random Fields with a variety of traditional and novel features. I show that this approach can achieve an overall F1 measure around 70, which seems to be the current state of the art. The system described here was developed as part of the BioNLP/NLPBA 2004 shared task. Experiments were conducted on a training and evaluation set provided by the task organizers.
PDF Abstract