The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in
The 198 rare disease mention annotations (from 145 MIMIC-III radiology reports) are in
test_set_RD_ann_MIMIC_III_rad.csv. To note that radiology reports were only used for testing and not for validation.
To note: a row can only be consider a true phenotype of the patient only when the value of the column gold mention-to-ORDO label is 1.
(i) Randomly sampled 500 discharge summaries (and 1000 radiology reports) from MIMIC-III
(ii) 312 of the 500 discharge summaries (and 145 of the 1000 radiology reports) have at least one positive UMLS mention linked to ORDO, as identified by SemEHR; there are altogether 1073 (and 198 in radiology reports) UMLS/ORDO mentions.
(iii) 3 medical informatics researchers (staff or PhD students) annotated the 1,073 mentions (and 2 medical informatics researchers annotated the 198 mentions in radiology reports), regarding whether they are the correct patient phenotypes matched to UMLS and ORDO. Contradictions in the annotations were then resolved by another research staff having biomedical background.
|ROW_ID||Identifier unique to each row, see
|SUBJECT_ID||Identifier unique to a patient, see
|HADM_ID||Identifier unique to a patient hospital stay, see
|document structure name||The document structure name of the mention. The document structure name is identified by SemEHR (only for discharge summaries).|
|document structure offset in full document||The start and ending offsets of the document structure texts (or template) in the whole discharge summary. The document structure is parsed by SemEHR with regular expressions (only for discharge summaries).|
|mention||The mention identified by SemEHR.|
|mention offset in document structure||The start and ending offsets of the mention in the document structure (only for discharge summaries).|
|mention offset in full document||The start and ending offsets of the mention in the whole discharge summary. They can be calculated by
|UMLS with desc||The UMLS identified by SemEHR, corresponding to the mention.|
|ORDO with desc||The ORDO matched to the UMLS, using the linkage in the ORDO ontology, see
|gold mention-to-UMLS label||Whether the mention-UMLS pair indicate a correct phenotype of the patient (i.e. a positive mention that correctly matches to the UMLS concept), 1 if correct, 0 if not.|
|gold UMLS-to-ORDO label||Whether the matching is correct from the UMLS concept to the ORDO concept, 1 if correct, 0 if not.|
|gold mention-to-ORDO label||Whether the mention-ORDO triple indicates a correct phenotype of the patient, 1 if correct, 0 if not. This column is 1 if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0.|
* These manual annotations are by no means to be perfect. There are hypothetical mentions which were difficult for the annotators to make a decision (see some notes in the raw annotations). Also, they are based on the output of
SemEHR, which does not have 100% recall, so the annotations may not cover all rare diseases mentions from the sampled discharge summaries.
* In row 323 from the full set or the validation set, the mention
nph is not in the document structure (due to error in mention extraction), thus the
gold mention-to-UMLS label is
The two excel workbooks,
for validation - SemEHR ori (MIMIC-III-DS, free text removed, with predictions).xlsx (annotations starting from column
CX and also in the third sheet,
distinct umls-ordo), and
for validation - 1000 docs - ori - MIMIC-III-rad (free text removed, with predictions).xlsx (annotations starting from column
show the raw annotations, including each annotator's results and notes, and the predictions of all baselines approaches/tools. The predictions were not available to the annotators when the annotations were made. Free texts of clinical notes have been removed before the publication of the data.