In this study, we investigate the level of implicit race information available to ML models and human experts and the implications of model-detectable differences in clinical notes.
Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups.
We also find that methods which achieve group fairness do so by worsening performance for all groups.
Deep metric learning (DML) enables learning with less supervision through its emphasis on the similarity structure of representations.
Finally, we apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention and demonstrate similar results.
Our results show that our method can fit simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.
This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0. 4 nats more TSI than in the Quora Question Pair.
Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning.
Stein variational gradient descent (SVGD) is a deterministic inference algorithm that evolves a set of particles to fit a target distribution.
Since naive importance sampling with the marginal density as a proposal requires exponential sample complexity in the true mutual information, we propose novel Multi-Sample Annealed Importance Sampling (AIS) bounds on mutual information.
This systematic investigation underlines the importance of accounting for the underlying data-generating mechanisms and fortifying data-preprocessing pipelines with a causal framework to develop methods robust to confounding biases.
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality.
no code implementations • 21 Jul 2021 • Imon Banerjee, Ananth Reddy Bhimireddy, John L. Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, Natalie Dullerud, Marzyeh Ghassemi, Shih-Cheng Huang, Po-Chih Kuo, Matthew P Lungren, Lyle Palmer, Brandon J Price, Saptarshi Purkayastha, Ayis Pyrros, Luke Oakden-Rayner, Chima Okechukwu, Laleh Seyyed-Kalantari, Hari Trivedi, Ryan Wang, Zachary Zaiman, Haoran Zhang, Judy W Gichoya
Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race.
Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML.
In this work, we benchmark the performance of eight domain generalization methods on multi-site clinical time series and medical imaging data.
Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data.
Reliable treatment effect estimation from observational data depends on the availability of all confounding information.
In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\% accuracy.
Next, we present MSBC, a classifier that applies MS-BERT to generate embeddings and predict EDSS and functional subscores.
Our results highlight lesser-known limitations of methods for DP learning in health care, models that exhibit steep tradeoffs between privacy and utility, and models whose predictions are disproportionately influenced by large demographic groups in the training data.
The use of machine learning (ML) in health care raises numerous ethical concerns, especially as models can amplify existing health inequities.
Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives.
Ranked #6 on Metric Learning on CARS196 (using extra training data)
Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks.
Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenge.
CheXpert is very useful, but is relatively computationally slow, especially when integrated with end-to-end neural pipelines, is non-differentiable so can't be used in any applications that require gradients to flow through the labeler, and does not yield probabilistic outputs, which limits our ability to improve the quality of the silver labeler through techniques such as active learning.
This dataset currently contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.
Domain shift, encountered when using a trained model for a new patient population, creates significant challenges for sequential decision making in healthcare since the target domain may be both data-scarce and confounded.
6 code implementations • 24 May 2020 • Joseph Paul Cohen, Lan Dao, Paul Morrison, Karsten Roth, Yoshua Bengio, Beiyi Shen, Almas Abbasi, Mahsa Hoshmand-Kochi, Marzyeh Ghassemi, Haifang Li, Tim Q Duong
In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images.
In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks.
We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups.
Ranked #1 on Multi-Label Classification on ChestX-ray14
We learn mappings from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1).
Particle-based inference algorithm is a promising method to efficiently generate samples for an intractable target distribution by iteratively updating a set of particles.
When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve.
Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced.
Ranked #3 on Length-of-Stay prediction on MIMIC-III
Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision.
Understanding if classifiers generalize to out-of-sample datasets is a central problem in machine learning.
The automatic generation of radiology reports given medical radiographs has significant potential to operationally and improve clinical patient care.
Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time.
We analyze the impact of age of the added samples and if they affect fairness in classification.
no code implementations • 17 Nov 2018 • Natalia Antropova, Andrew L. Beam, Brett K. Beaulieu-Jones, Irene Chen, Corey Chivers, Adrian Dalca, Sam Finlayson, Madalina Fiterau, Jason Alan Fries, Marzyeh Ghassemi, Mike Hughes, Bruno Jedynak, Jasvinder S. Kandola, Matthew McDermott, Tristan Naumann, Peter Schulam, Farah Shamout, Alexandre Yahi
This volume represents the accepted submissions from the Machine Learning for Health (ML4H) workshop at the conference on Neural Information Processing Systems (NeurIPS) 2018, held on December 8, 2018 in Montreal, Canada.
Making decisions about what clinical tasks to prepare for is multi-factored, and especially challenging in intensive care environments where resources must be balanced with patient needs.
There are established racial disparities in healthcare, including during end-of-life care, when poor communication and trust can lead to suboptimal outcomes for patients and their families.
Modern electronic health records (EHRs) provide data to answer clinically meaningful questions.
Risk prediction is central to both clinical medicine and public health.
Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions annually.
In this work, we propose a new approach to deduce optimal treatment policies for septic patients by using continuous state-space models and deep reinforcement learning.
Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs).
We use autoencoders to create low-dimensional embeddings of underlying patient phenotypes that we hypothesize are a governing factor in determining how different patients will react to different interventions.
Voice disorders affect an estimated 14 million working-aged Americans, and many more worldwide.