no code implementations • 11 Aug 2023 • Karim Lekadir, Aasa Feragen, Abdul Joseph Fofanah, Alejandro F Frangi, Alena Buyx, Anais Emelie, Andrea Lara, Antonio R Porras, An-Wen Chan, Arcadi Navarro, Ben Glocker, Benard O Botwe, Bishesh Khanal, Brigit Beger, Carol C Wu, Celia Cintas, Curtis P Langlotz, Daniel Rueckert, Deogratias Mzurikwao, Dimitrios I Fotiadis, Doszhan Zhussupov, Enzo Ferrante, Erik Meijering, Eva Weicken, Fabio A González, Folkert W Asselbergs, Fred Prior, Gabriel P Krestin, Gary Collins, Geletaw S Tegenaw, Georgios Kaissis, Gianluca Misuraca, Gianna Tsakou, Girish Dwivedi, Haridimos Kondylakis, Harsha Jayakody, Henry C Woodruf, Hugo JWL Aerts, Ian Walsh, Ioanna Chouvarda, Irène Buvat, Islem Rekik, James Duncan, Jayashree Kalpathy-Cramer, Jihad Zahir, Jinah Park, John Mongan, Judy W Gichoya, Julia A Schnabel, Kaisar Kushibar, Katrine Riklund, Kensaku MORI, Kostas Marias, Lameck M Amugongo, Lauren A Fromont, Lena Maier-Hein, Leonor Cerdá Alberich, Leticia Rittner, Lighton Phiri, Linda Marrakchi-Kacem, Lluís Donoso-Bach, Luis Martí-Bonmatí, M Jorge Cardoso, Maciej Bobowicz, Mahsa Shabani, Manolis Tsiknakis, Maria A Zuluaga, Maria Bielikova, Marie-Christine Fritzsche, Marius George Linguraru, Markus Wenzel, Marleen de Bruijne, Martin G Tolsgaard, Marzyeh Ghassemi, Md Ashrafuzzaman, Melanie Goisauf, Mohammad Yaqub, Mohammed Ammar, Mónica Cano Abadía, Mukhtar M E Mahmoud, Mustafa Elattar, Nicola Rieke, Nikolaos Papanikolaou, Noussair Lazrak, Oliver Díaz, Olivier Salvado, Oriol Pujol, Ousmane Sall, Pamela Guevara, Peter Gordebeke, Philippe Lambin, Pieta Brown, Purang Abolmaesumi, Qi Dou, Qinghua Lu, Richard Osuala, Rose Nakasi, S Kevin Zhou, Sandy Napel, Sara Colantonio, Shadi Albarqouni, Smriti Joshi, Stacy Carter, Stefan Klein, Steffen E Petersen, Susanna Aussó, Suyash Awate, Tammy Riklin Raviv, Tessa Cook, Tinashe E M Mutsvangwa, Wendy A Rogers, Wiro J Niessen, Xènia Puig-Bosch, Yi Zeng, Yunusa G Mohammed, Yves Saint James Aquino, Zohaib Salahuddin, Martijn P A Starmans
This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare.
Additionally, the supervised DML model that uses ECGs with access to 8, 172 mPCWP labels demonstrated significantly better performance on the mPCWP regression task compared to the supervised baseline.
In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment.
More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously.
Social determinants of health (SDOH) -- the conditions in which people live, grow, and age -- play a crucial role in a person's health and well-being.
Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records.
no code implementations • 23 Mar 2023 • Edward H. Lee, Brendan Kelly, Emre Altinmakas, Hakan Dogan, Maryam Mohammadzadeh, Errol Colak, Steve Fu, Olivia Choudhury, Ujjwal Ratan, Felipe Kitamura, Hernan Chaves, Jimmy Zheng, Mourad Said, Eduardo Reis, Jaekwang Lim, Patricia Yokoo, Courtney Mitchell, Golnaz Houshmand, Marzyeh Ghassemi, Ronan Killeen, Wendy Qiu, Joel Hayden, Farnaz Rafiee, Chad Klochko, Nicholas Bevins, Faeze Sazgara, S. Simon Wong, Michael Moseley, Safwan Halabi, Kristen W. Yeom
While it is well known that population differences from genetics, sex, race, and environmental factors contribute to disease, AI studies in medicine have largely focused on locoregional patient cohorts with less diverse data sources.
Since accurate estimation of MI without density information requires a sample size exponential in the true MI, we assume either a single marginal or the full joint density information is known.
We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.
We propose GRACE, a lifelong model editing method, which implements spot-fixes on streaming errors of a deployed model, ensuring minimal impact on unrelated inputs.
In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms.
Influence functions efficiently estimate the effect of removing a single training data point on a model's learned parameters.
Developing and deploying machine learning models safely depends on the ability to characterize and compare their abilities to generalize to new environments.
In this work, we show models that are personalized with group attributes can reduce performance at a group level.
In this study, we investigate the level of implicit race information available to ML models and human experts and the implications of model-detectable differences in clinical notes.
Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups.
Deep metric learning (DML) enables learning with less supervision through its emphasis on the similarity structure of representations.
We also find that methods which achieve group fairness do so by worsening performance for all groups.
Finally, we apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention and demonstrate similar results.
Our results show that our method can fit simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.
This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0. 4 nats more TSI than in the Quora Question Pair.
Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning.
Stein variational gradient descent (SVGD) is a deterministic inference algorithm that evolves a set of particles to fit a target distribution.
This systematic investigation underlines the importance of accounting for the underlying data-generating mechanisms and fortifying data-preprocessing pipelines with a causal framework to develop methods robust to confounding biases.
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality.
no code implementations • 21 Jul 2021 • Imon Banerjee, Ananth Reddy Bhimireddy, John L. Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, Natalie Dullerud, Marzyeh Ghassemi, Shih-Cheng Huang, Po-Chih Kuo, Matthew P Lungren, Lyle Palmer, Brandon J Price, Saptarshi Purkayastha, Ayis Pyrros, Luke Oakden-Rayner, Chima Okechukwu, Laleh Seyyed-Kalantari, Hari Trivedi, Ryan Wang, Zachary Zaiman, Haoran Zhang, Judy W Gichoya
Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race.
Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML.
In this work, we benchmark the performance of eight domain generalization methods on multi-site clinical time series and medical imaging data.
Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data.
Reliable treatment effect estimation from observational data depends on the availability of all confounding information.
In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\% accuracy.
Next, we present MSBC, a classifier that applies MS-BERT to generate embeddings and predict EDSS and functional subscores.
Our results highlight lesser-known limitations of methods for DP learning in health care, models that exhibit steep tradeoffs between privacy and utility, and models whose predictions are disproportionately influenced by large demographic groups in the training data.
The use of machine learning (ML) in health care raises numerous ethical concerns, especially as models can amplify existing health inequities.
Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives.
Ranked #9 on Metric Learning on CARS196 (using extra training data)
Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks.
Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenge.
CheXpert is very useful, but is relatively computationally slow, especially when integrated with end-to-end neural pipelines, is non-differentiable so can't be used in any applications that require gradients to flow through the labeler, and does not yield probabilistic outputs, which limits our ability to improve the quality of the silver labeler through techniques such as active learning.
This dataset currently contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.
Domain shift, encountered when using a trained model for a new patient population, creates significant challenges for sequential decision making in healthcare since the target domain may be both data-scarce and confounded.
6 code implementations • 24 May 2020 • Joseph Paul Cohen, Lan Dao, Paul Morrison, Karsten Roth, Yoshua Bengio, Beiyi Shen, Almas Abbasi, Mahsa Hoshmand-Kochi, Marzyeh Ghassemi, Haifang Li, Tim Q Duong
In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images.
In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks.
We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups.
Ranked #1 on Multi-Label Classification on ChestX-ray14
We learn mappings from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1).
Particle-based inference algorithm is a promising method to efficiently generate samples for an intractable target distribution by iteratively updating a set of particles.
When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve.
Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced.
Ranked #3 on Length-of-Stay prediction on MIMIC-III
Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision.
Understanding if classifiers generalize to out-of-sample datasets is a central problem in machine learning.
The automatic generation of radiology reports given medical radiographs has significant potential to operationally and improve clinical patient care.
Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time.
We analyze the impact of age of the added samples and if they affect fairness in classification.
no code implementations • 17 Nov 2018 • Natalia Antropova, Andrew L. Beam, Brett K. Beaulieu-Jones, Irene Chen, Corey Chivers, Adrian Dalca, Sam Finlayson, Madalina Fiterau, Jason Alan Fries, Marzyeh Ghassemi, Mike Hughes, Bruno Jedynak, Jasvinder S. Kandola, Matthew McDermott, Tristan Naumann, Peter Schulam, Farah Shamout, Alexandre Yahi
This volume represents the accepted submissions from the Machine Learning for Health (ML4H) workshop at the conference on Neural Information Processing Systems (NeurIPS) 2018, held on December 8, 2018 in Montreal, Canada.
Making decisions about what clinical tasks to prepare for is multi-factored, and especially challenging in intensive care environments where resources must be balanced with patient needs.
There are established racial disparities in healthcare, including during end-of-life care, when poor communication and trust can lead to suboptimal outcomes for patients and their families.
In this work, we characterize the doctor-patient relationship using a machine learning-derived trust score.
Modern electronic health records (EHRs) provide data to answer clinically meaningful questions.
Risk prediction is central to both clinical medicine and public health.
Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions annually.
In this work, we propose a new approach to deduce optimal treatment policies for septic patients by using continuous state-space models and deep reinforcement learning.
Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs).
We use autoencoders to create low-dimensional embeddings of underlying patient phenotypes that we hypothesize are a governing factor in determining how different patients will react to different interventions.
Voice disorders affect an estimated 14 million working-aged Americans, and many more worldwide.