CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning
Advances in graph machine learning (ML) have been driven by applications in chemistry as graphs have remained the most expressive representations of molecules. While early graph ML methods focused primarily on small organic molecules, recently, the scope of graph ML has expanded to include inorganic materials. Modelling the periodicity and symmetry of inorganic crystalline materials poses unique challenges, which existing graph ML methods are unable to address. Moving to inorganic nanomaterials increases complexity as the scale of number of nodes within each graph can be broad ($10$ to $10^5$). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.
PDF AbstractCode
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
X-ray PDF regression | CHILI-100K | Mean | MSE | 0.007 | # 1 | |
X-ray PDF regression | CHILI-100K | EdgeCNN | MSE | 0.012 +/- 0.000 | # 2 | |
X-ray PDF regression | CHILI-100K | GIN | MSE | 0.013 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-100K | GraphUNet | MSE | 0.013 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-100K | GAT | MSE | 0.013 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-100K | GraphSAGE | MSE | 0.037 +/- 0.026 | # 8 | |
X-ray PDF regression | CHILI-100K | PMLP | MSE | 0.013 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-100K | GCN | MSE | 0.014 +/- 0.000 | # 7 | |
XRD regression | CHILI-100K | EdgeCNN | MSE | 0.006 +/- 0.000 | # 1 | |
XRD regression | CHILI-100K | GIN | MSE | 0.009 +/- 0.000 | # 3 | |
XRD regression | CHILI-100K | GraphUNet | MSE | 0.009 +/- 0.000 | # 3 | |
XRD regression | CHILI-100K | GAT | MSE | 0.108 +/- 0.172 | # 8 | |
XRD regression | CHILI-100K | GraphSAGE | MSE | 0.018 +/- 0.014 | # 6 | |
XRD regression | CHILI-100K | PMLP | MSE | 0.008 +/- 0.001 | # 2 | |
XRD regression | CHILI-100K | GCN | MSE | 0.009 +/- 0.000 | # 3 | |
XRD regression | CHILI-100K | Mean | MSE | 0.021 | # 7 | |
SAXS regression | CHILI-100K | EdgeCNN | MSE | 0.007 +/- 0.009 | # 2 | |
SAXS regression | CHILI-100K | GIN | MSE | 0.009 +/- 0.000 | # 3 | |
SAXS regression | CHILI-100K | GraphUNet | MSE | 0.009 +/- 0.000 | # 3 | |
SAXS regression | CHILI-100K | GAT | MSE | 0.009 +/- 0.000 | # 3 | |
SAXS regression | CHILI-100K | GraphSAGE | MSE | 0.011 +/- 0.002 | # 7 | |
SAXS regression | CHILI-100K | PMLP | MSE | 0.003 +/- 0.000 | # 1 | |
SAXS regression | CHILI-100K | GCN | MSE | 0.010 +/- 0.000 | # 6 | |
SAXS regression | CHILI-100K | Mean | MSE | 0.038 | # 8 | |
Distance regression | CHILI-100K | EdgeCNN | MSE | 0.030 +/- 0.001 | # 1 | |
Distance regression | CHILI-100K | GIN | MSE | 0.491 +/- 0.038 | # 8 | |
Distance regression | CHILI-100K | GraphUNet | MSE | 0.085 +/- 0.002 | # 3 | |
Distance regression | CHILI-100K | GAT | MSE | 0.252 +/- 0.003 | # 5 | |
Distance regression | CHILI-100K | GraphSAGE | MSE | 0.064 +/- 0.001 | # 2 | |
Distance regression | CHILI-100K | PMLP | MSE | 0.486 +/- 0.014 | # 7 | |
Distance regression | CHILI-100K | GCN | MSE | 0.090 +/- 0.002 | # 4 | |
Distance regression | CHILI-100K | Mean | MSE | 0.307 | # 6 | |
Position regression | CHILI-100K | EdgeCNN | Positional MAE | 16.336 +/- 0.000 | # 2 | |
Position regression | CHILI-100K | GIN | Positional MAE | 16.336 +/- 0.000 | # 2 | |
Position regression | CHILI-100K | GraphUNet | Positional MAE | 14.824 +/- 0.315 | # 1 | |
Position regression | CHILI-100K | GAT | Positional MAE | 16.336 +/- 0.000 | # 2 | |
Position regression | CHILI-100K | GraphSAGE | Positional MAE | 16.337 +/- 0.000 | # 8 | |
Position regression | CHILI-100K | PMLP | Positional MAE | 16.336 +/- 0.000 | # 2 | |
Position regression | CHILI-100K | GCN | Positional MAE | 16.336 +/- 0.000 | # 2 | |
Position regression | CHILI-100K | Mean | Positional MAE | 16.336 | # 2 | |
Space group classification | CHILI-100K | EdgeCNN | F1-score (Weighted) | 0.158 +/- 0.035 | # 1 | |
Space group classification | CHILI-100K | GIN | F1-score (Weighted) | 0.043 +/- 0.000 | # 5 | |
Space group classification | CHILI-100K | GraphUNet | F1-score (Weighted) | 0.043 +/- 0.000 | # 5 | |
Space group classification | CHILI-100K | GAT | F1-score (Weighted) | 0.044 +/- 0.001 | # 3 | |
Space group classification | CHILI-100K | GraphSAGE | F1-score (Weighted) | 0.044 +/- 0.002 | # 3 | |
Space group classification | CHILI-100K | PMLP | F1-score (Weighted) | 0.047 +/- 0.012 | # 2 | |
Space group classification | CHILI-100K | GCN | F1-score (Weighted) | 0.043 +/- 0.001 | # 5 | |
Space group classification | CHILI-100K | Most Frequent Class | F1-score (Weighted) | 0.010 | # 8 | |
Space group classification | CHILI-100K | Random | F1-score (Weighted) | 0.002 +/- 0.001 | # 9 | |
Crystal system classification | CHILI-100K | EdgeCNN | F1-score (Weighted) | 0.072 +/- 0.047 | # 4 | |
Crystal system classification | CHILI-100K | GIN | F1-score (Weighted) | 0.069 +/- 0.040 | # 5 | |
Crystal system classification | CHILI-100K | GraphUNet | F1-score (Weighted) | 0.068 +/- 0.006 | # 7 | |
Crystal system classification | CHILI-100K | GAT | F1-score (Weighted) | 0.110 +/- 0.029 | # 3 | |
Crystal system classification | CHILI-100K | GraphSAGE | F1-score (Weighted) | 0.061 +/- 0.019 | # 8 | |
Crystal system classification | CHILI-100K | PMLP | F1-score (Weighted) | 0.124 +/- 0.036 | # 2 | |
Crystal system classification | CHILI-100K | GCN | F1-score (Weighted) | 0.069 +/- 0.023 | # 5 | |
Crystal system classification | CHILI-100K | Most Frequent Class | F1-score (Weighted) | 0.046 | # 9 | |
Crystal system classification | CHILI-100K | Random | F1-score (Weighted) | 0.168 +/- 0.014 | # 1 | |
Atomic number classification | CHILI-100K | EdgeCNN | F1-score (Weighted) | 0.572 +/- 0.017 | # 1 | |
Atomic number classification | CHILI-100K | GIN | F1-score (Weighted) | 0.336 +/- 0.005 | # 2 | |
Atomic number classification | CHILI-100K | GraphUNet | F1-score (Weighted) | 0.287 +/- 0.004 | # 3 | |
Atomic number classification | CHILI-100K | GAT | F1-score (Weighted) | 0.192 +/- 0.000 | # 6 | |
Atomic number classification | CHILI-100K | GraphSAGE | F1-score (Weighted) | 0.195 +/- 0.007 | # 5 | |
Atomic number classification | CHILI-100K | PMLP | F1-score (Weighted) | 0.191 +/- 0.000 | # 8 | |
Atomic number classification | CHILI-100K | GCN | F1-score (Weighted) | 0.275 +/- 0.002 | # 4 | |
Atomic number classification | CHILI-100K | Most Frequent Class | F1-score (Weighted) | 0.192 | # 6 | |
Atomic number classification | CHILI-100K | Random | F1-score (Weighted) | 0.015 +/- 0.000 | # 9 | |
X-ray PDF regression | CHILI-3K | EdgeCNN | MSE | 0.011 +/- 0.000 | # 2 | |
X-ray PDF regression | CHILI-3K | GraphUNet | MSE | 0.012 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-3K | GAT | MSE | 0.029 +/- 0.030 | # 7 | |
X-ray PDF regression | CHILI-3K | GraphSAGE | MSE | 0.012 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-3K | PMLP | MSE | 0.012 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-3K | GCN | MSE | 0.012 +/- 0.000 | # 3 | |
X-ray PDF regression | CHILI-3K | Mean | MSE | 0.008 | # 1 | |
XRD regression | CHILI-3K | EdgeCNN | MSE | 0.008 +/- 0.001 | # 1 | |
XRD regression | CHILI-3K | GraphUNet | MSE | 0.010 +/- 0.000 | # 2 | |
XRD regression | CHILI-3K | GAT | MSE | 0.010 +/- 0.000 | # 2 | |
XRD regression | CHILI-3K | GraphSAGE | MSE | 0.010 +/- 0.000 | # 2 | |
XRD regression | CHILI-3K | PMLP | MSE | 0.010 +/- 0.000 | # 2 | |
XRD regression | CHILI-3K | GCN | MSE | 0.010 +/- 0.000 | # 2 | |
XRD regression | CHILI-3K | Mean | MSE | 0.017 | # 7 | |
SAXS regression | CHILI-3K | EdgeCNN | MSE | 0.006 +/- 0.004 | # 1 | |
SAXS regression | CHILI-3K | GIN | MSE | 0.008 +/- 0.000 | # 2 | |
SAXS regression | CHILI-3K | GraphUNet | MSE | 0.008 +/- 0.000 | # 2 | |
SAXS regression | CHILI-3K | GAT | MSE | 0.008 +/- 0.000 | # 2 | |
SAXS regression | CHILI-3K | GraphSAGE | MSE | 0.008 +/- 0.001 | # 2 | |
SAXS regression | CHILI-3K | PMLP | MSE | 0.022 +/- 0.025 | # 7 | |
SAXS regression | CHILI-3K | GCN | MSE | 0.008 +/- 0.000 | # 2 | |
SAXS regression | CHILI-3K | Mean | MSE | 0.037 | # 8 | |
Distance regression | CHILI-3K | EdgeCNN | MSE | 0.015 +/- 0.001 | # 1 | |
Distance regression | CHILI-3K | GIN | MSE | 0.464 +/- 0.005 | # 8 | |
Distance regression | CHILI-3K | GraphUNet | MSE | 0.055 +/- 0.001 | # 2 | |
Distance regression | CHILI-3K | GAT | MSE | 0.342 +/- 0.117 | # 6 | |
Distance regression | CHILI-3K | GraphSAGE | MSE | 0.055 +/- 0.002 | # 2 | |
Distance regression | CHILI-3K | PMLP | MSE | 0.359 +/- 0.017 | # 7 | |
Distance regression | CHILI-3K | GCN | MSE | 0.056 +/- 0.006 | # 4 | |
Distance regression | CHILI-3K | Mean | MSE | 0.265 | # 5 | |
Position regression | CHILI-3K | EdgeCNN | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | GIN | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | GraphUNet | Positional MAE | 14.765 +/- 0.395 | # 1 | |
Position regression | CHILI-3K | GAT | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | GraphSAGE | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | PMLP | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | GCN | Positional MAE | 16.575 +/- 0.000 | # 2 | |
Position regression | CHILI-3K | Mean | Positional MAE | 16.575 | # 2 | |
Space group classification | CHILI-3K | EdgeCNN | F1-score (Weighted) | 0.733 +/- 0.207 | # 1 | |
Space group classification | CHILI-3K | GIN | F1-score (Weighted) | 0.125 +/- 0.026 | # 4 | |
Space group classification | CHILI-3K | GraphUNet | F1-score (Weighted) | 0.095 +/- 0.036 | # 8 | |
Space group classification | CHILI-3K | GAT | F1-score (Weighted) | 0.113 +/- 0.013 | # 5 | |
Space group classification | CHILI-3K | GraphSAGE | F1-score (Weighted) | 0.151 +/- 0.045 | # 2 | |
Space group classification | CHILI-3K | PMLP | F1-score (Weighted) | 0.135 +/- 0.006 | # 3 | |
Space group classification | CHILI-3K | GCN | F1-score (Weighted) | 0.099 +/- 0.019 | # 7 | |
Space group classification | CHILI-3K | Most Frequent Class | F1-score (Weighted) | 0.108 | # 6 | |
Space group classification | CHILI-3K | Random | F1-score (Weighted) | 0.009 +/- 0.008 | # 9 | |
Crystal system classification | CHILI-3K | EdgeCNN | F1-score (Weighted) | 0.657 +/- 0.196 | # 1 | |
Crystal system classification | CHILI-3K | GIN | F1-score (Weighted) | 0.438 +/- 0.004 | # 5 | |
Crystal system classification | CHILI-3K | GraphUNet | F1-score (Weighted) | 0.431 +/- 0.014 | # 6 | |
Crystal system classification | CHILI-3K | GAT | F1-score (Weighted) | 0.504 +/- 0.076 | # 2 | |
Crystal system classification | CHILI-3K | GraphSAGE | F1-score (Weighted) | 0.422 +/- 0.037 | # 7 | |
Crystal system classification | CHILI-3K | PMLP | F1-score (Weighted) | 0.440 +/- 0.036 | # 3 | |
Crystal system classification | CHILI-3K | GCN | F1-score (Weighted) | 0.367 +/- 0.127 | # 8 | |
Crystal system classification | CHILI-3K | Most Frequent Class | F1-score (Weighted) | 0.440 | # 3 | |
Crystal system classification | CHILI-3K | Random | F1-score (Weighted) | 0.191 +/- 0.008 | # 9 | |
Atomic number classification | CHILI-3K | EdgeCNN | F1-score (Weighted) | 0.632 +/- 0.009 | # 1 | |
Atomic number classification | CHILI-3K | GIN | F1-score (Weighted) | 0.587 +/- 0.002 | # 2 | |
Atomic number classification | CHILI-3K | GraphUNet | F1-score (Weighted) | 0.552 +/- 0.079 | # 3 | |
Atomic number classification | CHILI-3K | GAT | F1-score (Weighted) | 0.461 +/- 0.000 | # 6 | |
Atomic number classification | CHILI-3K | GraphSAGE | F1-score (Weighted) | 0.491 +/- 0.004 | # 5 | |
Atomic number classification | CHILI-3K | PMLP | F1-score (Weighted) | 0.461 +/- 0.000 | # 6 | |
Atomic number classification | CHILI-3K | GCN | F1-score (Weighted) | 0.496 +/- 0.001 | # 4 | |
Atomic number classification | CHILI-3K | Most Frequent Class | F1-score (Weighted) | 0.461 | # 6 | |
Atomic number classification | CHILI-3K | Random | F1-score (Weighted) | 0.016 +/- 0.000 | # 9 |