This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files:
- manually_annotated_set.xlsx - set of 2000 manualy annotated sentences with entities
- Unbalanced_dataset.xlsx - set of 12000 sentences, out of which 2000 are from the first set, manually annotated, and the rest have been added using rule based method by adding sentences where extraction had confidence 1.
- Balanced_dataset_SUB_PRED.xlsx - balanced dataset generated by taking 2000 manually annotated sentences, but then adding sentences from the rule-based method with confidence 1 in such a way that each relationship class had at least 1400 sentences (for biomarkers, we could obtain 1243 sentences with confidence 1 from a processed portion of the data we had at the time of building the dataset).