The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.
The corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms “human”, “blood cells”, and “transcription factors”. The corpus has been annotated with various levels of linguistic and semantic information.
The primary categories of annotation in the GENIA corpus and the corresponding subcorpora are: