Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Molecular Property Prediction BBBP AttrMasking ROC-AUC 89.2 # 6
Molecular Property Prediction BBBP AttentiveFP ROC-AUC 85.5 # 8
TDC ADMET Benchmarking Group tdcommons MLP-RDKit2D TDC.Solubility_AqSolDB 0.827 # 3
TDC.Caco2_Wang 0.393 # 4
TDC.HIA_Hou 0.972 # 4
TDC.Pgp_Broccatelli 0.918 # 2
TDC.Bioavailability_Ma 0.672 # 2
TDC.Lipophilicity_AstraZeneca 0.574 # 5
TDC.BBB_Martins 0.889 # 3
TDC.PPBR_AZ 9.994 # 3
TDC.VDss_Lombardo 0.561 # 2
TDC.CYP2D6_Inhibition_Veith 0.616 # 4
TDC.CYP3A4_Inhibition_Veith 0.829 # 4
TDC.CYP2C9_Inhibition_Veith 0.742 # 4
TDC.CYP2D6_Substrate_CarbonMangels 0.677 # 2
TDC.CYP3A4_Substrate_CarbonMangels 0.639 # 2
TDC.CYP2C9_Substrate_CarbonMangels 0.360 # 4
TDC.Half_Life_Obach 0.184 # 3
TDC.Clearance_Microsome_AZ 0.586 # 1
TDC.Clearance_Hepatocyte_AZ 0.382 # 3
TDC.hERG 0.841 # 1
TDC.AMES 0.823 # 3
TDC.DILI 0.875 # 4
TDC.LD50_Zhu 0.678 # 3
TDC ADMET Benchmarking Group tdcommons GCN TDC.Solubility_AqSolDB 0.907 # 4
TDC.Caco2_Wang 0.599 # 7
TDC.HIA_Hou 0.936 # 5
TDC.Pgp_Broccatelli 0.895 # 4
TDC.Bioavailability_Ma 0.566 # 5
TDC.Lipophilicity_AstraZeneca 0.541 # 2
TDC.BBB_Martins 0.842 # 5
TDC.PPBR_AZ 10.194 # 5
TDC.VDss_Lombardo 0.457 # 4
TDC.CYP2D6_Inhibition_Veith 0.616 # 4
TDC.CYP3A4_Inhibition_Veith 0.840 # 3
TDC.CYP2C9_Inhibition_Veith 0.735 # 5
TDC.CYP2D6_Substrate_CarbonMangels 0.617 # 3
TDC.CYP3A4_Substrate_CarbonMangels 0.590 # 3
TDC.CYP2C9_Substrate_CarbonMangels 0.344 # 5
TDC.Half_Life_Obach 0.239 # 2
TDC.Clearance_Microsome_AZ 0.532 # 3
TDC.Clearance_Hepatocyte_AZ 0.366 # 4
TDC.hERG 0.738 # 5
TDC.AMES 0.818 # 4
TDC.DILI 0.859 # 5
TDC.LD50_Zhu 0.649 # 2
TDC ADMET Benchmarking Group tdcommons AttentiveFP TDC.Solubility_AqSolDB 0.776 # 2
TDC.Caco2_Wang 0.401 # 5
TDC.HIA_Hou 0.974 # 3
TDC.Pgp_Broccatelli 0.892 # 5
TDC.Bioavailability_Ma 0.632 # 3
TDC.Lipophilicity_AstraZeneca 0.572 # 4
TDC.BBB_Martins 0.855 # 4
TDC.PPBR_AZ 9.373 # 2
TDC.VDss_Lombardo 0.241 # 5
TDC.CYP2D6_Inhibition_Veith 0.646 # 3
TDC.CYP3A4_Inhibition_Veith 0.851 # 2
TDC.CYP2C9_Inhibition_Veith 0.749 # 3
TDC.CYP2D6_Substrate_CarbonMangels 0.574 # 4
TDC.CYP3A4_Substrate_CarbonMangels 0.576 # 5
TDC.CYP2C9_Substrate_CarbonMangels 0.375 # 3
TDC.Half_Life_Obach 0.085 # 5
TDC.Clearance_Microsome_AZ 0.365 # 5
TDC.Clearance_Hepatocyte_AZ 0.289 # 5
TDC.hERG 0.825 # 2
TDC.AMES 0.814 # 5
TDC.DILI 0.886 # 3
TDC.LD50_Zhu 0.678 # 3
TDC ADMET Benchmarking Group tdcommons AttrMasking TDC.Solubility_AqSolDB 1.026 # 5
TDC.Caco2_Wang 0.546 # 6
TDC.HIA_Hou 0.978 # 2
TDC.Pgp_Broccatelli 0.929 # 1
TDC.Bioavailability_Ma 0.577 # 4
TDC.Lipophilicity_AstraZeneca 0.547 # 3
TDC.BBB_Martins 0.892 # 2
TDC.PPBR_AZ 10.075 # 4
TDC.VDss_Lombardo 0.559 # 3
TDC.CYP2D6_Inhibition_Veith 0.721 # 2
TDC.CYP3A4_Inhibition_Veith 0.902 # 1
TDC.CYP2C9_Inhibition_Veith 0.829 # 2
TDC.CYP2D6_Substrate_CarbonMangels 0.704 # 1
TDC.CYP3A4_Substrate_CarbonMangels 0.582 # 4
TDC.CYP2C9_Substrate_CarbonMangels 0.381 # 2
TDC.Half_Life_Obach 0.151 # 4
TDC.Clearance_Microsome_AZ 0.585 # 2
TDC.Clearance_Hepatocyte_AZ 0.413 # 2
TDC.hERG 0.778 # 4
TDC.AMES 0.842 # 2
TDC.DILI 0.919 # 2
TDC.LD50_Zhu 0.685 # 5

Methods


No methods listed for this paper. Add relevant methods here