You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site/src/routes/preprint/+page.md
+34-18
Original file line number
Diff line number
Diff line change
@@ -142,17 +142,6 @@ Only current knowledge is accessible to a real discovery campaign and our metric
142
142
143
143
### Models
144
144
145
-
{#if mounted}
146
-
<MetricsTable />
147
-
{/if}
148
-
149
-
> @label:fig:metrics-table Classification and regression metrics for all models tested on our benchmark ranked by F1 score.
150
-
> The heat map ranges from yellow (best) to blue (worst) performance.
151
-
> DAF = discovery acceleration factor (see text), TPR = true positive rate, TNR = false negative rate, MAE = mean absolute error, RMSE = root mean squared error.
152
-
> The dummy classifier uses the 'scikit-learn' 'stratified' strategy of randomly assigning stable/unstable labels according to the training set prevalence.
153
-
> The dummy regression metrics MAE, RMSE and $R^2$ are attained by always predicting the test set mean.
154
-
> The Voronoi RF, CGCNN and MEGNet models are seen to be worse than the dummy result on regression metrics but better on some of the classification metrics, highlighting the importance of looking at the right metrics for the task at hand to gauge model performance.
155
-
156
145
To test a wide variety of methodologies proposed for learning the potential energy landscape, our initial benchmark release includes 10 models.
157
146
158
147
1.**CHGNet**[@deng_chgnet_2023] (UIP-GNN) - CHGNet is a UIP for charge-informed atomistic modeling.
@@ -199,7 +188,34 @@ To test a wide variety of methodologies proposed for learning the potential ener
199
188
200
189
## Results
201
190
202
-
shows performance metrics for all models included in the initial release of Matbench Discovery.
191
+
{#if mounted}
192
+
<MetricsTable />
193
+
{/if}
194
+
195
+
> @label:fig:metrics-table Classification and regression metrics for all models tested on our benchmark ranked by F1 score.
196
+
> The heat map ranges from yellow (best) to blue (worst) performance.
197
+
> DAF = discovery acceleration factor (see text), TPR = true positive rate, TNR = false negative rate, MAE = mean absolute error, RMSE = root mean squared error.
198
+
> The dummy classifier uses the 'scikit-learn' 'stratified' strategy of randomly assigning stable/unstable labels according to the training set prevalence.
199
+
> The dummy regression metrics MAE, RMSE and $R^2$ are attained by always predicting the test set mean.
200
+
> The Voronoi RF, CGCNN and MEGNet models are seen to be worse than the dummy result on regression metrics but better on some of the classification metrics, highlighting the importance of looking at the right metrics for the task at hand to gauge model performance.
201
+
>
202
+
> <details>
203
+
> <summary>Table glossary</summary>
204
+
>
205
+
> - DAF = discovery acceleration factor
206
+
> - TPR = true positive rate, the fraction of stable structures correctly predicted as stable
207
+
> - TNR = true negative rate, the fraction of unstable structures correctly predicted as unstable
208
+
> - MAE = mean absolute error
209
+
> - RMSE = root mean squared error
210
+
> - GNN = graph neural network
211
+
> - UIP = universal interatomic potential
212
+
> - BO = Bayesian optimization
213
+
> - RF = random forest
214
+
> - +P = training data augmentation using random structure perturbations
215
+
>
216
+
> </details>
217
+
218
+
@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
203
219
CHGNet takes the top spot on all metrics except true positive rate (TPR) and emerges as the current SOTA for ML-guided materials discovery.
204
220
The discovery acceleration factor (DAF) measures how many more stable structures a model found compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7% achieved by randomly selecting test set crystals.
205
221
The maximum possible DAF is the inverse of the dummy discovery rate which on our dataset is ~6.
@@ -238,7 +254,7 @@ Our results demonstrate that this is not a given.
238
254
> This figure highlights how different models perform better or worse depending on the length of the discovery campaign.
239
255
> The UIP models (CHGNet, M3GNet, MACE) are seen to offer significantly improved precision on shorter campaigns as they are less prone to early false positive predictions.
240
256
241
-
has models rank materials by model-predicted hull distance from most to least stable; materials furthest below the known hull at the top, materials right on the hull at the bottom.
257
+
@Fig:cumulative-precision-recallhas models rank materials by model-predicted hull distance from most to least stable; materials furthest below the known hull at the top, materials right on the hull at the bottom.
242
258
For each model, we iterate through that list and calculate at each step the precision and recall of correctly identified stable materials.
243
259
This simulates exactly how these models would be used in a prospective materials discovery campaign and reveals how a model's performance changes as a function of the discovery campaign length. As a practitioner, you have a certain amount of resources available to validate model predictions. These curves allow you to read off the best model given these conditions and based on the optimal trade-off between fewer false positives (precision) or fewer negatives (recall) for the discovery task at hand.
244
260
In this case, it so happens that CHGNet achieves the highest precision _and_ recall at any number of screened materials.
@@ -274,7 +290,7 @@ All force-free models exhibit a much worse case of early-on precision drop, fall
274
290
> If the model's error for a given prediction happens to point towards the stability threshold at 0 eV from the hull (the plot's center), its average error will change the stability classification of a material from true positive/negative to false negative/positive.
275
291
> The width of the 'rolling window' box indicates the width over which errors hull distance prediction errors were averaged.
276
292
277
-
provides a visual representation of the reliability of different models based on the rolling mean absolute error (MAE) of model-predicted hull distances as a function of DFT distance to the Materials Project (MP) convex hull.
293
+
@Fig:rolling-mae-vs-hull-dist-modelsprovides a visual representation of the reliability of different models based on the rolling mean absolute error (MAE) of model-predicted hull distances as a function of DFT distance to the Materials Project (MP) convex hull.
278
294
The red-shaded area, referred to as the 'triangle of peril', emphasizes the zone where the average model error surpasses the distance to the stability threshold at 0 eV.
279
295
As long as the rolling MAE remains within this triangle, the model is most susceptible to misclassifying structures.
280
296
Because the average error is larger than the distance to the classification threshold at 0, it is large enough to flip a correct classification into an incorrect one (if the error happens to point toward the stability threshold).
@@ -299,7 +315,7 @@ For @fig:rolling-mae-vs-hull-dist-models, this means models need to be much more
299
315
## Discussion
300
316
301
317
We have demonstrated the effectiveness of ML-based triage in HT materials discovery and posit that the benefits of including ML in discovery workflows now clearly outweigh the costs.
302
-
shows in a realistic benchmark scenario that several models achieve a discovery acceleration greater than 2.5 across the whole dataset and up to 5 when considering only the 10k most stable predictions from each model (@fig:metrics-table-first-10k).
318
+
@fig:metrics-tableshows in a realistic benchmark scenario that several models achieve a discovery acceleration greater than 2.5 across the whole dataset and up to 5 when considering only the 10k most stable predictions from each model (@fig:metrics-table-first-10k).
303
319
When starting this project, we were unsure which is the most promising ML methodology for HT discovery.
304
320
Our findings demonstrate a clear superiority in accuracy and extrapolation performance of UIPs like CHGNet, M3GNet and MACE.
305
321
Modeling forces enables these models to chart a path through atomic configuration space closer to the DFT-relaxed structure from where a more informed final energy prediction is possible.
@@ -390,7 +406,7 @@ A material is classified as stable if the predicted $E_\text{above hull}$ lies b
390
406
391
407
> @label:fig:each-scatter-models Parity plot for each model's energy above hull predictions (based on their formation energy predictions) vs DFT ground truth, color-coded by log density of points.
392
408
393
-
shows that all models do well for materials far below the convex hull (left side of the plot). Performance for materials far above the convex hull is more varied with occasional underpredictions of the energy of materials far above the convex hull (right side). All models suffer most in the mode of the distribution at $x = 0$.
409
+
@Fig:each-scatter-modelsshows that all models do well for materials far below the convex hull (left side of the plot). Performance for materials far above the convex hull is more varied with occasional underpredictions of the energy of materials far above the convex hull (right side). All models suffer most in the mode of the distribution at $x = 0$.
394
410
395
411
Two models stand out as anomalous to the general trends.
396
412
@@ -410,7 +426,7 @@ Since these derailed values are easily identified in practice when actually perf
410
426
411
427
> @label:fig:wrenformer-failures Symmetry analysis of the 941 Wrenformer failure cases in the shaded rectangle defined by $E_\text{DFT hull dist} < 1$ and $E_\text{ML hull dist} > 1$. Sunburst plot of spacegroups shows that close to 80% of severe energy overestimations are orthorhombic with spacegroup 71. The table on the right shows occurrence counts of exact structure prototypes for each material in the sunburst plot as well as their corresponding prevalence in the training set.
412
428
413
-
shows 456 + 194 ($\sim$ 70%) of the failure cases in the shaded rectangle are two prototypes in spacegroup 71.
429
+
@Fig:wrenformer-failuresshows 456 + 194 ($\sim$ 70%) of the failure cases in the shaded rectangle are two prototypes in spacegroup 71.
414
430
The occurrence of those same prototypes in the MP training set shows almost no data support for the failing prototypes.
415
431
This suggests the reason Wrenformer fails so spectacularly on these structures is that it cannot deal with structure prototypes it has not seen at least several hundred examples of in its training data.
416
432
This suggests that there are stronger limitations on how much the discrete Wyckoff-based representation can extrapolate to new prototypes compared to the smooth local-environment-based inputs to GNN-type models.
@@ -439,7 +455,7 @@ Note the CGCNN+P histogram is more strongly peaked than CGCNN's which agrees bet
439
455
440
456
As a reminder, the WBM test set was generated in 5 successive batches, each step applying another element replacement to an MP source structure or a new stable crystal generated in one of the previous replacement rounds. The likelihood by which one element replaces another is governed by ISCD-mined chemical similarity scores for each pair of elements. Naively, one would expect model performance to degrade with increasing batch count, as repeated substitutions should on average 'diffuse' deeper into uncharted regions of material space, requiring the model to extrapolate more. We observe this effect for some models much more than others.
441
457
442
-
shows the rolling MAE as a function of distance to the convex hull for each of the 5 WBM rounds of elemental substitution. These plots show a stronger performance decrease for Wrenformer and Voronoi RF than for UIPs like M3GNet, CHGNet, MACE and even force-less GNNs with larger errors like MEGNet and CGCNN.
458
+
@Fig:rolling-mae-vs-hull-dist-wbm-batches-modelsshows the rolling MAE as a function of distance to the convex hull for each of the 5 WBM rounds of elemental substitution. These plots show a stronger performance decrease for Wrenformer and Voronoi RF than for UIPs like M3GNet, CHGNet, MACE and even force-less GNNs with larger errors like MEGNet and CGCNN.
0 commit comments