Skip to content

Commit 551050e

Browse files
committed
make horizontal versions of roc-models-2x4.pdf and model-run-times-bar.pdf
add MACE and matlantis refs
1 parent 4b9da09 commit 551050e

File tree

10 files changed

+334
-53
lines changed

10 files changed

+334
-53
lines changed

matbench_discovery/structure.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010

1111

1212
def perturb_structure(struct: Structure, gamma: float = 1.5) -> Structure:
13-
"""Perturb the atomic coordinates of a pymatgen structure.
13+
"""Perturb the atomic coordinates of a pymatgen structure. Used for CGCNN+P
14+
training set augmentation.
1415
1516
Args:
1617
struct (Structure): pymatgen structure to be perturbed

readme.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
1818
Matbench Discovery is an [interactive leaderboard](https://janosh.github.io/matbench-discovery/models) and associated [PyPI package](https://pypi.org/project/matbench-discovery) which together make it easy to rank ML energy models on a task designed to closely simulate a high-throughput discovery campaign for new stable inorganic crystals.
1919

20-
So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.
20+
So far, we've tested 8 models covering multiple methodologies ranging from random forests with structure fingerprints to graph neural networks, from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find [CHGNet](https://github.com/CederGroupHub/chgnet) ([paper](https://doi.org/10.48550/arXiv.2302.14231)) to achieve the highest F1 score of 0.59, $R^2$ of 0.61 and a discovery acceleration factor (DAF) of 3.06 (meaning a 3x higher rate of stable structures compared to dummy selection in our already enriched search space). We believe our results show that ML models have become robust enough to deploy them as triaging steps to more effectively allocate compute in high-throughput DFT relaxations. This work provides valuable insights for anyone looking to build large-scale materials databases.
2121

2222
<slot name="metrics-table" />
2323

scripts/analyze_element_errors.py

+16-16
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,22 @@
3131
)
3232

3333

34+
# %% map average model error onto elements
35+
frac_comp_col = "fractional composition"
36+
df_wbm[frac_comp_col] = [
37+
Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
38+
]
39+
40+
df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
41+
df_wbm.index
42+
)
43+
assert all(
44+
df_frac_comp.sum(axis=1).round(6) == 1
45+
), "composition fractions don't sum to 1"
46+
47+
# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100) # remove Xe with only 1 entry
48+
49+
3450
# %%
3551
df_mp = pd.read_csv(DATA_FILES.mp_energies, na_filter=False).set_index("material_id")
3652
# compute number of samples per element in training set
@@ -50,22 +66,6 @@
5066
fig.show()
5167

5268

53-
# %% map average model error onto elements
54-
frac_comp_col = "fractional composition"
55-
df_wbm[frac_comp_col] = [
56-
Composition(comp).fractional_composition for comp in tqdm(df_wbm.formula)
57-
]
58-
59-
df_frac_comp = pd.DataFrame(comp.as_dict() for comp in df_wbm[frac_comp_col]).set_index(
60-
df_wbm.index
61-
)
62-
assert all(
63-
df_frac_comp.sum(axis=1).round(6) == 1
64-
), "composition fractions don't sum to 1"
65-
66-
# df_frac_comp = df_frac_comp.dropna(axis=1, thresh=100) # remove Xe with only 1 entry
67-
68-
6969
# %%
7070
for label, srs in (
7171
("MP", df_elem_err[train_count_col]),

scripts/calc_wandb_model_runtimes.py

+25-8
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
import pandas as pd
1313
import plotly.express as px
14+
import plotly.graph_objects as go
1415
import requests
1516
import wandb
1617
import wandb.apis.public
@@ -122,9 +123,6 @@
122123
print(f"{df_stats[time_col].sum()=:.0f} hours")
123124

124125
# df_stats.round(2).to_json(f"{MODELS}/model-stats.json", orient="index")
125-
126-
127-
# %% plot model run times as pie chart
128126
df_time = (
129127
df_stats.sort_index()
130128
.filter(like=time_col)
@@ -134,6 +132,9 @@
134132
# .drop(index="BOWSR + MEGNet")
135133
.reset_index(names=(model_col := "Model"))
136134
)
135+
136+
137+
# %% plot model run times as pie chart
137138
fig = px.pie(
138139
df_time,
139140
values=time_col,
@@ -179,18 +180,34 @@
179180

180181
# %% plot model run times as bar chart
181182
fig = df_melt.dropna().plot.bar(
182-
y=time_col,
183-
x=model_col,
183+
x=time_col,
184+
y=model_col,
184185
backend="plotly",
185186
# color=time_col,
186187
text_auto=".0f",
187188
text=time_col,
188189
color=model_col,
189190
)
190-
title = f"Total: {df_stats[time_col].sum():.0f} h"
191+
# reduce bar width
192+
fig.update_traces(width=0.7)
193+
194+
title = f"All models: {df_stats[time_col].sum():.0f} h"
191195
fig.layout.legend.update(x=0.98, y=0.98, xanchor="right", yanchor="top", title=title)
192196
fig.layout.xaxis.title = ""
193197
fig.layout.margin.update(l=0, r=0, t=0, b=0)
194-
save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")
195-
save_fig(fig, f"{PDF_FIGS}/model-run-times-bar.pdf")
198+
# save_fig(fig, f"{FIGS}/model-run-times-bar.svelte")
199+
200+
pdf_fig = go.Figure(fig)
201+
# replace legend with annotation in PDF
202+
pdf_fig.layout.showlegend = False
203+
pdf_fig.add_annotation(
204+
text=title,
205+
font=dict(size=15),
206+
x=0.99,
207+
y=0.99,
208+
showarrow=False,
209+
xref="paper",
210+
yref="paper",
211+
)
212+
save_fig(pdf_fig, f"{PDF_FIGS}/model-run-times-bar.pdf", height=300, width=800)
196213
fig.show()

scripts/prc_roc_curves_models.py

+19-14
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66

77
# %%
8+
import math
9+
810
import pandas as pd
911
from pymatviz.utils import save_fig
1012
from sklearn.metrics import auc, precision_recall_curve, roc_curve
@@ -23,6 +25,9 @@
2325
facet_col = "Model"
2426
color_col = "Stability Threshold"
2527

28+
n_cols = 4
29+
n_rows = math.ceil(len(models) // n_cols)
30+
2631

2732
# %%
2833
df_roc = pd.DataFrame()
@@ -50,7 +55,7 @@
5055
x="FPR",
5156
y="TPR",
5257
facet_col=facet_col,
53-
facet_col_wrap=2,
58+
facet_col_wrap=4,
5459
backend="plotly",
5560
height=150 * len(df_roc[facet_col].unique()),
5661
color=color_col,
@@ -59,33 +64,32 @@
5964
range_color=(-0.5, 0.5),
6065
hover_name=facet_col,
6166
hover_data={facet_col: False},
67+
facet_col_spacing=0.03,
68+
facet_row_spacing=0.1,
6269
)
6370
)
6471

6572
for anno in fig.layout.annotations:
6673
anno.text = anno.text.split("=", 1)[1] # remove Model= from subplot titles
6774

68-
fig.layout.coloraxis.colorbar.update(
69-
x=1,
70-
y=1,
71-
xanchor="right",
72-
yanchor="top",
73-
thickness=14,
74-
lenmode="pixels",
75-
len=210,
76-
title_side="right",
77-
)
75+
fig.layout.coloraxis.colorbar.update(thickness=14, title_side="right")
76+
if n_cols == 2:
77+
fig.layout.coloraxis.colorbar.update(
78+
x=1, y=1, xanchor="right", yanchor="top", lenmode="pixels", len=210
79+
)
80+
7881
fig.add_shape(type="line", x0=0, y0=0, x1=1, y1=1, line=line, row="all", col="all")
7982
fig.add_annotation(text="No skill", x=0.5, y=0.5, showarrow=False, yshift=-10)
8083
# allow scrolling and zooming each subplot individually
8184
fig.update_xaxes(matches=None)
85+
fig.layout.margin.update(l=0, r=0, b=0, t=20, pad=0)
8286
fig.update_yaxes(matches=None)
8387
fig.show()
8488

8589

8690
# %%
87-
save_fig(fig, f"{FIGS}/roc-models.svelte")
88-
save_fig(fig, f"{PDF_FIGS}/roc-models.pdf")
91+
# save_fig(fig, f"{FIGS}/roc-models-{n_rows}x{n_cols}.svelte")
92+
save_fig(fig, f"{PDF_FIGS}/roc-models-{n_rows}x{n_cols}.pdf", width=1000, height=400)
8993

9094

9195
# %%
@@ -142,6 +146,7 @@
142146

143147

144148
# %%
145-
save_fig(fig, f"{FIGS}/prc-models.svelte")
149+
save_fig(fig, f"{FIGS}/prc-models-{n_rows}x{n_cols}.svelte")
150+
save_fig(fig, f"{PDF_FIGS}/prc-models-{n_rows}x{n_cols}.pdf")
146151
fig.update_yaxes(matches=None)
147152
fig.show()

scripts/scatter_e_above_hull_models.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66

77
# %%
8+
import math
9+
810
import numpy as np
911
import plotly.express as px
1012
from pymatviz.utils import add_identity_line, bin_df_cols, save_fig
@@ -118,6 +120,8 @@
118120

119121
# %% plot all models in separate subplots
120122
domain = (-4, 7)
123+
n_cols = 4
124+
n_rows = math.ceil(len(models) / n_cols)
121125

122126
fig = px.scatter(
123127
df_bin,
@@ -224,7 +228,6 @@
224228

225229

226230
# %%
227-
n_rows, n_cols, *_ = np.array(fig._validate_get_grid_ref(), object).shape
228231
fig_name = f"each-scatter-models-{n_rows}x{n_cols}"
229232
save_fig(fig, f"{FIGS}/{fig_name}.svelte")
230233
save_fig(fig, f"{PDF_FIGS}/{fig_name}.pdf")

site/src/routes/preprint/+page.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ However, using the DFT-relaxed structure as input to CGCNN renders the discovery
6767
As the name suggests, this work seeks to expand upon the original Matbench suite of property prediction tasks @dunn_benchmarking_2020. By providing a standardized collection of datasets along with canonical cross-validation splits for model evaluation, Matbench helped focus the field of ML for materials, increase comparability across papers and provide a quantitative measure of progress in the field. It aimed to catalyze the field of ML for materials through competition and establishing common goal posts in a similar fashion as ImageNet did for computer vision.
6868

6969
Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
70-
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
70+
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
7171
Importantly, all tasks were exclusively concerned with the properties of known materials.
7272
We believe a task that simulates a materials discovery campaign by requiring materials stability prediction from unrelaxed structures to be a missing piece here.
7373

@@ -107,7 +107,7 @@ To simulate a real discovery campaign, our test set inputs are unrelaxed structu
107107

108108
## Models
109109

110-
Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
110+
Our initial benchmark release includes 8 models.
111111

112112
1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.
113113

@@ -163,7 +163,7 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all
163163
>
164164
> </details>
165165
166-
@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
166+
@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
167167
CHGNet takes the top spot on all metrics except true positive rate (TPR) and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7\% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures than uncharted materials space at large. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs.
168168

169169
Note that MEGNet outperforms M3GNet on DAF (2.70 vs 2.66) even though M3GNet is superior to MEGNet in all other metrics. The reason is the one outlined in the previous paragraph as becomes clear from @fig:cumulative-clf-metrics. MEGNet's line ends at 55.6 k materials which is closest to the true number of 43 k stable materials in our test set. All other models overpredict the sum total of stable materials by anywhere from 40% (~59 k for CGCNN) to 104% (85 k for Wrenformer), resulting in large numbers of false positive predictions which lower their DAFs.
@@ -180,7 +180,7 @@ The reason CGCNN+P achieves better regression metrics than CGCNN but is still wo
180180
<CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />
181181
{/if}
182182

183-
> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for.
183+
> @label:fig:cumulative-clf-metrics Cumulative precision and recall over the course of a simulated discovery campaign. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. Length here is an integer measuring how many DFT relaxations you have compute budget for. We only show the 6 best performing models for visual clarity.
184184
185185
@Fig:cumulative-clf-metrics simulates ranking materials from most to least stable according to model-predicted energies. For each model, we go down that list material by material, calculating at each step the precision and recall of correctly identified stable materials. This simulates exactly how these models might be used in a prospective materials discovery campaign and reveal how a model's performance changes as a function of the discovery campaign length, i.e. the amount of resources available to validate model predictions.
186186

site/src/routes/preprint/iclr-ml4mat/+page.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
1111
<summary>
1212

13-
We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential-based relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
13+
We present a new machine learning (ML) benchmark for materials stability predictions named `Matbench Discovery`. A goal of this benchmark is to highlight the need to focus on metrics that directly measure their utility in prospective discovery campaigns as opposed to analyzing models based on predictive accuracy alone. Our benchmark consists of a task designed to closely simulate the deployment of ML energy models in a high-throughput search for stable inorganic crystals. We explore a wide variety of models covering multiple methodologies ranging from random forests to GNNs, and from one-shot predictors to iterative Bayesian optimizers and interatomic potential relaxers. We find M3GNet to achieve the highest F1 score of 0.58 and $R^2$ of 0.59 while MEGNet wins on discovery acceleration factor (DAF) with 2.94. Our results provide valuable insights for maintainers of high throughput materials databases to start using these models as triaging steps to more effectively allocate compute for DFT relaxations.
1414

1515
</summary>
1616

@@ -51,7 +51,7 @@ As the name suggests, this work seeks to expand upon the original Matbench suite
5151
and attempt to accelerate the field similar to what ImageNet did for computer vision.
5252

5353
Matbench released a test suite of 13 supervised tasks for different material properties ranging from thermal (formation energy, phonon frequency peak), electronic (band gap), optical (refractive index) to tensile and elastic (bulk and shear moduli).
54-
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources. 4 tasks are composition-only while 9 provide the relaxed crystal structure as input.
54+
They range in size from ~300 to ~132,000 samples and include both DFT and experimental data sources.
5555
Importantly, all tasks were exclusively concerned with the properties of known materials.
5656
We believe a task that simulates a materials discovery campaign by requiring materials stability predictions from unrelaxed structures to be a missing piece here.
5757

@@ -86,7 +86,7 @@ Moreover, to simulate a discovery campaign our test set inputs are unrelaxed str
8686

8787
## Models
8888

89-
Our initial benchmark release includes 8 models. @Fig:metrics-table includes all models but we focus on the 6 best performers in subsequent figures for visual clarity.
89+
Our initial benchmark release includes 8 models.
9090

9191
1. **Voronoi+RF** @ward_including_2017 - A random forest trained to map a combination of composition-based Magpie features and structure-based relaxation-invariant Voronoi tessellation features (effective coordination numbers, structural heterogeneity, local environment properties, ...) to DFT formation energies.
9292

@@ -108,14 +108,14 @@ Our initial benchmark release includes 8 models. @Fig:metrics-table includes all
108108

109109
> @label:fig:metrics-table Regression and classification metrics for all models tested on our benchmark. The heat map ranges from yellow (best) to blue (worst) performance. DAF = discovery acceleration factor (see text), TPR = true positive rate, TNR = false negative rate, MAE = mean absolute error, RMSE = root mean squared error
110110
111-
@Fig:metrics-table shows performance metrics for all models considered in v1 of our benchmark.
111+
@Fig:metrics-table shows performance metrics for all models included in the initial release of Matbench Discovery.
112112
M3GNet takes the top spot on most metrics and emerges as current SOTA for ML-guided materials discovery. The discovery acceleration factor (DAF) measures how many more stable structures a model found among the ones it predicted stable compared to the dummy discovery rate of 43k / 257k $\approx$ 16.7% achieved by randomly selecting test set crystals. Consequently, the maximum possible DAF is ~6. This highlights the fact that our benchmark is made more challenging by deploying models on an already enriched space with a much higher fraction of stable structures over randomly exploring materials space. As the convex hull becomes more thoroughly sampled by future discovery, the fraction of unknown stable structures decreases, naturally leading to less enriched future test sets which will allow for higher maximum DAFs. The reason MEGNet outperforms M3GNet on DAF becomes clear from @fig:cumulative-clf-metrics by noting that MEGNet's line ends closest to the total number of stable materials. The other models overpredict this number, resulting in large numbers of false positive predictions that drag down their DAFs.
113113

114114
{#if browser}
115115
<RollingMaeVsHullDistModels />
116116
{/if}
117117

118-
> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures.
118+
> @label:fig:rolling-mae-vs-hull-dist-models Rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. The white box in the bottom left indicates the size of the rolling window. The highlighted 'triangle of peril' shows where the models are most likely to misclassify structures. We only show the 6 best performing models for visual clarity. We only show the 6 best performing models for visual clarity.
119119
120120
{#if browser}
121121
<CumulativeClfMetrics style="margin: 0 -2em 0 -4em;" />

0 commit comments

Comments
 (0)