HiRID-ICU-Benchmark -- A Comprehensive Machine Learning Benchmark on High-resolution ICU Data

The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.

PDF Abstract NeurIPS Datasets 2021 PDF NeurIPS Datasets 2021 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Remaining Length of Stay HiRID LGBM MAE 56.9±0.4 # 1
Respiratory Failure HiRID LSTM AUPRC 0.569±0.003 # 7
Kidney Function HiRID LGBM MAE 0.45±0.00 # 1
Kidney Function HiRID TCN MAE 0.50±0.01 # 5
Kidney Function HiRID LSTM MAE 0.50±0.01 # 5
Kidney Function HiRID GRU MAE 0.49±0.02 # 4
Kidney Function HiRID Transformer MAE 0.48±0.02 # 3
Kidney Function HiRID LGBM ( + hand crafted features) MAE 0.45±0.00 # 1
Circulatory Failure HiRID LR AUPRC 0.305±0.000 # 6
Circulatory Failure HiRID LSTM AUPRC 0.32.2±0.008 # 7
Circulatory Failure HiRID TCN AUPRC 0.35.8±0.006 # 7
Circulatory Failure HiRID GRU AUPRC 0.368±0.005 # 4
Circulatory Failure HiRID Transformer AUPRC 0.352±0.006 # 5
Circulatory Failure HiRID LGBM AUPRC 0.389±0.003 # 2
Circulatory Failure HiRID LGBM ( + hand crafted features) AUPRC 0.388±0.002 # 3
Respiratory Failure HiRID Logistic Regression AUPRC 0.530±0.000 # 8
Respiratory Failure HiRID GRU AUPRC 0.592±0.003 # 4
Respiratory Failure HiRID TCN AUPRC 0.589±0.003 # 5
Respiratory Failure HiRID Transformer AUPRC 0.594±0.003 # 3
Respiratory Failure HiRID LGBM AUPRC 0.585±0.001 # 6
Respiratory Failure HiRID LGBM ( + hand crafted features) AUPRC 0.604±0.002 # 1
Patient Phenotyping HiRID Logistic Regression Balanced Accuracy 39.1±0.0 # 7
Patient Phenotyping HiRID GRU Balanced Accuracy 39.2±2.1 # 6
Patient Phenotyping HiRID LSTM Balanced Accuracy 39.5±1.2 # 5
Patient Phenotyping HiRID LGBM Balanced Accuracy 40.4±0.8 # 4
Patient Phenotyping HiRID TCN Balanced Accuracy 41.6±2.3 # 3
Patient Phenotyping HiRID Transformer Balanced Accuracy 42.7±1.4 # 2
Patient Phenotyping HiRID LGBM ( + hand crafted features) Balanced Accuracy 45.8±2.0 # 1
ICU Mortality HiRID LGBM AUPRC 0.546±0.008 # 7
ICU Mortality HiRID Logistic Regression AUPRC 0.581±0.000 # 6
ICU Mortality HiRID LSTM AUPRC 0.600±0.009 # 5
ICU Mortality HiRID TCN AUPRC 0.602±0.011 # 4
ICU Mortality HiRID GRU AUPRC 0.603 ±0.016 # 3
ICU Mortality HiRID Transformer AUPRC 0.610±0.008 # 2
ICU Mortality HiRID LGBM ( + hand crafted features) AUPRC 0.626±0.000 # 1
Remaining Length of Stay HiRID LSTM MAE 60.7±1.6 # 6
Remaining Length of Stay HiRID GRU MAE 60.6±0.9 # 5
Remaining Length of Stay HiRID TCN MAE 59.8±2.8 # 4
Remaining Length of Stay HiRID Transformer MAE 59.5±2.8 # 3
Remaining Length of Stay HiRID LGBM ( + hand crafted features) MAE 57.0±0.3 # 2

Methods


No methods listed for this paper. Add relevant methods here