Rare Diseases Mentions in MIMIC-III Dataset

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

## Data annotation

The 1,073 full rare disease mention annotations (from 312 MIMIC-III **discharge summaries**) are in [`full_set_RD_ann_MIMIC_III_disch.csv`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/full_set_RD_ann_MIMIC_III_disch.csv).

The data split:
* the first 400 rows are used for validation, [`validation_set_RD_ann_MIMIC_III_disch.csv`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/validation_set_RD_ann_MIMIC_III_disch.csv), and
* the last 673 rows are used for testing, [`test_set_RD_ann_MIMIC_III_disch.csv`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/test_set_RD_ann_MIMIC_III_disch.csv).

The 198 rare disease mention annotations (from 145 MIMIC-III **radiology reports**) are in [`test_set_RD_ann_MIMIC_III_rad.csv`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/test_set_RD_ann_MIMIC_III_rad.csv). To note that radiology reports were only used for testing and not for validation.

**To note**: a row can only be consider a true phenotype of the patient only when the value of the column **gold mention-to-ORDO label** is 1.

## Data sampling and annotation procedure
* (i) Randomly sampled 500 discharge summaries (and 1000 radiology reports) from MIMIC-III

* (ii) 312 of the 500 discharge summaries (and 145 of the 1000 radiology reports) have at least one positive UMLS mention linked to ORDO, as identified by SemEHR; there are altogether 1073 (and 198 in radiology reports) UMLS/ORDO mentions.

* (iii) 3 medical informatics researchers (staff or PhD students) annotated the 1,073 mentions (and 2 medical informatics researchers annotated the 198 mentions in radiology reports), regarding whether they are the correct patient phenotypes matched to UMLS and ORDO. Contradictions in the annotations were then resolved by another research staff having biomedical background.

## Data dictionary

| Column   Name                                | Description                                                                                                                                                                                                   |
|----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ROW_ID                                       | Identifier unique to each row, see [`https://mimic.physionet.org/mimictables/noteevents/`](https://mimic.physionet.org/mimictables/noteevents/)                                                                                                                                                     |
| SUBJECT_ID                                | Identifier unique to a patient, see [`https://mimic.physionet.org/mimictables/noteevents/`](https://mimic.physionet.org/mimictables/noteevents/)                                                                                                                                                                                                              |
| HADM_ID                                      | Identifier unique to a patient hospital stay, see [`https://mimic.physionet.org/mimictables/noteevents/`](https://mimic.physionet.org/mimictables/noteevents/)                                                                                                                                                                                                              |
| document structure name                    | The document structure name of the mention. The document structure name is identified by   SemEHR  (only for discharge summaries).                                                                                                          |
| document structure offset in full document | The start and ending offsets of the document structure texts (or template) in the whole discharge summary. The document structure is parsed by SemEHR with regular expressions  (only for discharge summaries).                            |
| mention                                      | The mention identified by SemEHR.                                                                                                                                                                          |
| mention offset in document structure       | The start and ending offsets of the mention in the document structure (only for discharge summaries).                                                                                                                                      |
| mention offset in full document            | The start and ending offsets of the mention in the whole discharge summary. They can be calculated by `document structure offset in full document` and `mention offset in document structure`.                                                                                     |
| UMLS with desc                               | The UMLS identified by SemEHR, corresponding to the mention.                                                                                                                                                |
| ORDO with desc                               | The ORDO matched to the UMLS, using the linkage in the ORDO ontology, see [`https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325`](https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325) as an example.          |
| gold mention-to-UMLS label                 | Whether the mention-UMLS pair indicate a correct phenotype of the patient (i.e. a positive mention that correctly matches to the UMLS concept), 1 if correct, 0 if not.                                 |
| gold UMLS-to-ORDO label                    | Whether the matching is correct from the UMLS concept to the ORDO concept, 1 if correct, 0 if not.                                                                                                          |
| gold mention-to-ORDO label                 | Whether the mention-ORDO triple indicates a correct phenotype of the patient, 1 if correct, 0 if not. This column is 1 if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0. |

**Note:**
* These manual annotations are by no means to be perfect. There are hypothetical mentions which were difficult for the annotators to make a decision (see some notes in the raw annotations). Also, they are based on the output of [`SemEHR`](https://github.com/CogStack/CogStack-SemEHR), which does not have 100% recall, so the annotations may not cover all rare diseases mentions from the sampled discharge summaries.
* In row 323 from the full set or the validation set, the mention `nph` is not in the document structure (due to error in mention extraction), thus the `gold mention-to-UMLS label` is `-1`.

## Raw annotations (with model predictions)
The two excel workbooks,

* [`for validation - SemEHR ori (MIMIC-III-DS, free text removed, with predictions).xlsx`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/raw%20annotations%20(with%20model%20predictions)/for%20validation%20-%20SemEHR%20ori%20(MIMIC-III-DS%2C%20free%20text%20removed%2C%20with%20predictions).xlsx) (annotations starting from column `CX` and also in the third sheet, `distinct umls-ordo`), and

* [`for validation - 1000 docs - ori - MIMIC-III-rad (free text removed, with predictions).xlsx`](https://github.com/acadTags/Rare-disease-identification/blob/main/data%20annotation/raw%20annotations%20(with%20model%20predictions)/for%20validation%20-%201000%20docs%20-%20ori%20-%20MIMIC-III-rad%20(free%20text%20removed%2C%20with%20predictions).xlsx) (annotations starting from column `Z`),

show the raw annotations, including each annotator's results and notes, and the predictions of all baselines approaches/tools. The predictions were not available to the annotators when the annotations were made. Free texts of clinical notes have been removed before the publication of the data.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Data annotation

Data sampling and annotation procedure

Data dictionary

Raw annotations (with model predictions)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

BLUE

Usage

License

Modalities

Languages

Column Name	Description
ROW_ID	Identifier unique to each row, see `https://mimic.physionet.org/mimictables/noteevents/`
SUBJECT_ID	Identifier unique to a patient, see `https://mimic.physionet.org/mimictables/noteevents/`
HADM_ID	Identifier unique to a patient hospital stay, see `https://mimic.physionet.org/mimictables/noteevents/`
document structure name	The document structure name of the mention. The document structure name is identified by SemEHR (only for discharge summaries).
document structure offset in full document	The start and ending offsets of the document structure texts (or template) in the whole discharge summary. The document structure is parsed by SemEHR with regular expressions (only for discharge summaries).
mention	The mention identified by SemEHR.
mention offset in document structure	The start and ending offsets of the mention in the document structure (only for discharge summaries).
mention offset in full document	The start and ending offsets of the mention in the whole discharge summary. They can be calculated by `document structure offset in full document` and `mention offset in document structure`.
UMLS with desc	The UMLS identified by SemEHR, corresponding to the mention.
ORDO with desc	The ORDO matched to the UMLS, using the linkage in the ORDO ontology, see `https://www.ebi.ac.uk/ols/ontologies/ordo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_3325` as an example.
gold mention-to-UMLS label	Whether the mention-UMLS pair indicate a correct phenotype of the patient (i.e. a positive mention that correctly matches to the UMLS concept), 1 if correct, 0 if not.
gold UMLS-to-ORDO label	Whether the matching is correct from the UMLS concept to the ORDO concept, 1 if correct, 0 if not.
gold mention-to-ORDO label	Whether the mention-ORDO triple indicates a correct phenotype of the patient, 1 if correct, 0 if not. This column is 1 if both the mention-to-UMLS label and the UMLS-to-ORDO label are 1, otherwise 0.