SHADR (sythetic SDoH Human Annotated Demographic Robustness dataset (SHADR))

Introduced by Guevara et al. in Large Language Models to Identify Social Determinants of Health in Electronic Health Records

SDoH Human Annotated Demoographic Robustness (SHADR) Dataset

Overview

The Social determinants of health (SDoH) play a pivotal role in determining patient outcomes. However, their documentation in electronic health records (EHR) remains incomplete. This dataset was created from a study examining the capability of large language models in extracting SDoH from the free text sections of EHRs. Furthermore, the study delved into the potential of synthetic clinical text to bolster the extraction process of these scarcely documented, yet crucial, clinical data.

Dataset Structure & Modification

To understand potential biases in high-performing models and in those pre-trained on general text, GPT-4 was utilized to infuse demographic descriptors into our synthetic data.

For instance: - Original Sentence: "Widower admits fears surrounding potential judgment…" - Modified Sentence: "Hispanic widower admits fears surrounding potential judgment..."

Such demographic-infused sentences underwent manual validation. Out of these: - 419 had mentions of SDoH - 253 had mentions of adverse SDoH - The remainder were tagged as NO_SDoH

Instructions for Model Evaluation

  1. Initially, run your model inference on the original sentences.
  2. Subsequently, apply the same model to infer on the demographic-modified sentences.
  3. Perform comparisons for robustness.

For a detailed understanding of the "adverse" labeling, refer to https://arxiv.org/pdf/2308.06354.pdf. Here, the 'adverse' column demarcates if the label corresponds to an "adverse" or "non-adverse" SDoH.

Current Performance Metrics

  • Best Model Performance:
  • Any SDoH: 88% Macro-F1
  • Adverse SDoH: 84% Macro-F1

  • Robustness Rate:

  • Any SDoH: 9.9%
  • Adverse SDoH: 14.3%
<hr />

How to Cite:

@misc{guevara2023large,
      title={Large Language Models to Identify Social Determinants of Health in Electronic Health Records}, 
      author={Marco Guevara and Shan Chen and Spencer Thomas and Tafadzwa L. Chaunzwa and Idalid Franco and Benjamin Kann and Shalini Moningi and Jack Qian and Madeleine Goldstein and Susan Harper and Hugo JWL Aerts and Guergana K. Savova and Raymond H. Mak and Danielle S. Bitterman},
      year={2023},
      eprint={2308.06354},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • cc-by-4.0

Modalities


Languages