Skip to content

Commit 4121f49

Browse files
committed
add citation.cff, add step-by-step guide for adding new models to the leaderboard to site/src/routes/how-to-contribute/+page.md
motivate choice of Voronoi RF as baseline model in models/voronoi/readme.md
1 parent 2908fdf commit 4121f49

File tree

3 files changed

+165
-8
lines changed

3 files changed

+165
-8
lines changed

citation.cff

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
cff-version: 1.2.0
2+
title: Matbench Discovery
3+
message: If you use this software, please cite it as below.
4+
authors:
5+
- family-names: Riebesell
6+
given-names: Janosh
7+
affiliation: University of Cambridge, Lawrence Berkeley National Laboratory
8+
9+
orcid: https://orcid.org/0000-0001-5233-3462
10+
corresponding: true
11+
- family-names: Goodall
12+
given-names: Rhys
13+
affiliation: University of Cambridge
14+
orcid: https://orcid.org/0000-0002-6589-1700
15+
- family-names: Jain
16+
given-name: Anubhav
17+
orcid: 0000-0001-5893-9967
18+
affiliation: Lawrence Berkeley National Laboratory
19+
- family-names: Persson
20+
given-name: Kristin
21+
orcid: 0000-0003-2495-5509
22+
affiliation: Lawrence Berkeley National Laboratory
23+
- family-names: King-Smith
24+
given-name: Emma
25+
orcid: 0000-0002-2999-0955
26+
affiliation: University of Cambridge
27+
- family-names: Lee
28+
given-name: Alpha
29+
orcid: 0000-0002-9616-3108
30+
affiliation: University of Cambridge
31+
license: MIT
32+
license-url: https://github.com/janosh/matbench-discovery/blob/main/license"
33+
repository-code: https://github.com/janosh/matbench-discovery
34+
type: software
35+
url: https://github.com/janosh/matbench-discovery
36+
doi: TODO
37+
version: 1.0.0 # replace with whatever Matbench Discovery version you use
38+
date-released: TODO

models/voronoi/readme.md

+13-3
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,27 @@
1-
# Voronoi Tessellation with `matminer` featurization piped into `scikit-learn` `RandomForestRegressor`
1+
# Voronoi Random Forest
2+
3+
## Model Architecture
4+
5+
Voronoi tessellation with `matminer` featurization piped into `scikit-learn` [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor).
6+
7+
## Reasoning
8+
9+
The idea behind this combination of features and model was to have an easy-to-implement baseline algorithm. It's a bit outdated in that it uses handcrafted Magpie features (which have been shown to underperform learned features on datasets exceeding ~10^4 samples) but not so weak as to be indefensible. The fact that its featurization based on Voronoi tessellation is invariant to crystal structure relaxation makes it a natural choice for predicting unrelaxed crystal stability.
210

311
## OOM errors during featurization
412

5-
`multiprocessing` seems to be the cause of out-of-memory errors on large structures. Initially couldn't get the `matminer` `MultipleFeaturizer` to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
13+
There was an obstacle that actually made this model more difficult to train and test than anticipated. `matminer` uses `multiprocessing` which seems to be the cause of out-of-memory errors on large structures. Initially couldn't get [`MultipleFeaturizer`] to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
614

715
```log
816
MultipleFeaturizer: 28%|██▊ | 724/2575 [01:08<04:15, 7.25it/s]/var/spool/slurm/slurmd/job7401930/slurm_script: line 4: 2625851 Killed python
917
slurmstepd: error: Detected 52 oom-kill event(s) in StepId=7401930.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
1018
4:00
1119
```
1220

13-
Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to set `featurizer.set_n_jobs(1)`.
21+
Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to try `featurizer.set_n_jobs(1)`.
1422

1523
## Archive
1624

1725
Files in `2022-10-04-rhys-voronoi.zip` received from Rhys via [Slack](https://ml-physics.slack.com/archives/DD8GBBRLN/p1664929946687049). They are unchanged originals.
26+
27+
[`multiplefeaturizer`]: https://hackingmaterials.lbl.gov/matminer/matminer.featurizers.html#matminer.featurizers.base.MultipleFeaturizer

site/src/routes/how-to-contribute/+page.md

+114-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Installation
22

3-
The recommended way to use this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
3+
The recommended way to acquire the train and test data for this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
44

55
```zsh
66
pip install matbench-discovery
@@ -14,16 +14,45 @@ Here's an example script of how to download the training and test set files for
1414

1515
```py notest
1616
from matbench_discovery.data import load_train_test
17-
from matbench_discovery.data import df_wbm
17+
from matbench_discovery.data import df_wbm, DATA_FILENAMES
18+
19+
# any subset of these keys can be passed to load_train_test()
20+
assert sorted(DATA_FILENAMES) == [
21+
"mp-computed-structure-entries",
22+
"mp-elemental-ref-energies",
23+
"mp-energies",
24+
"mp-patched-phase-diagram",
25+
"wbm-computed-structure-entries",
26+
"wbm-initial-structures",
27+
"wbm-summary",
28+
]
1829

1930
df_wbm = load_train_test("wbm-summary", version="v1.0.0")
2031

2132
assert df_wbm.shape == (256963, 17)
2233

23-
assert list(df_wbm) == ['formula', 'n_sites', 'volume', 'uncorrected_energy', 'e_form_per_atom_wbm', 'e_hull_wbm', 'bandgap_pbe', 'uncorrected_energy_from_cse', 'e_correction_per_atom_legacy', 'e_correction_per_atom_mp2020', 'e_above_hull_uncorrected_ppd_mp', 'e_above_hull_mp2020_corrected_ppd_mp', 'e_above_hull_legacy_corrected_ppd_mp', 'e_form_per_atom_uncorrected', 'e_form_per_atom_mp2020_corrected', 'e_form_per_atom_legacy_corrected', 'wyckoff_spglib']
34+
assert list(df_wbm) == [
35+
"formula",
36+
"n_sites",
37+
"volume",
38+
"uncorrected_energy",
39+
"e_form_per_atom_wbm",
40+
"e_hull_wbm",
41+
"bandgap_pbe",
42+
"uncorrected_energy_from_cse",
43+
"e_correction_per_atom_legacy",
44+
"e_correction_per_atom_mp2020",
45+
"e_above_hull_uncorrected_ppd_mp",
46+
"e_above_hull_mp2020_corrected_ppd_mp",
47+
"e_above_hull_legacy_corrected_ppd_mp",
48+
"e_form_per_atom_uncorrected",
49+
"e_form_per_atom_mp2020_corrected",
50+
"e_form_per_atom_legacy_corrected",
51+
"wyckoff_spglib",
52+
]
2453
```
2554

26-
Column glossary
55+
`"wbm-summary"` column glossary:
2756

2857
1. `formula`: A compound's unreduced alphabetical formula
2958
1. `n_sites`: Number of sites in the structure's unit cell
@@ -39,7 +68,7 @@ Column glossary
3968

4069
## Direct Download
4170

42-
You can also download the data files directly:
71+
You can also download the data files directly from GitHub:
4372

4473
1. [`2022-10-19-wbm-summary.csv`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-10-19-wbm-summary.csv) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-10-19-wbm-summary.csv)]: Computed material properties only, no structures. Available properties are VASP energy, formation energy, energy above the convex hull, volume, band gap, number of sites per unit cell, and more. e_form_per_atom and e_above_hull each have 3 separate columns for old, new and no Materials
4574
1. [`2022-10-19-wbm-init-structs.json`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-10-19-wbm-init-structs.json) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-10-19-wbm-init-structs.json)]: Unrelaxed WBM structures
@@ -50,3 +79,83 @@ You can also download the data files directly:
5079
1. [`2022-09-19-mp-elemental-ref-energies.json`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-09-19-mp-elemental-ref-energies.json) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-09-19-mp-elemental-ref-energies.json)]: Minimum energy PDEntries for each element present in the Materials Project
5180

5281
[wbm paper]: https://nature.com/articles/s41524-020-00481-6
82+
83+
## How to submit a new model to the leaderboard
84+
85+
To add a new model to this benchmark, please create a pull request to the `main` branch of <https://github.com/janosh/matbench-discovery> that includes at least these 3 required files:
86+
87+
1. `<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz`: Your model's energy predictions for all ~250k WBM compounds as compressed JSON or CSV. Recommended way to create this file is with `pandas.DataFrame.to_{json|csv}('<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz')`. JSON is preferred over CSV if your model not only predicts energies (floats) but also Python objects like e.g. pseudo-relaxed structures (see the M3GNet and BOWSR test scripts).
88+
1. `test_<model-name>.(py|ipynb)`: The Python script or Jupyter notebook used to generate the energy predictions. Ideally, this file should have comments explaining at a high level what the code is doing and how the model works so others can understand and reproduce your results. If the model deployed on this benchmark was trained specifically for this purpose (i.e. if you wrote any training/fine-tuning code while preparing your PR), please also include it as `train_<model-name>.(py|ipynb)`.
89+
1. `metadata.yml`: A file to record all relevant metadata your algorithm like model name, authors (can be different for the model and the PR), package requirements, relevant citations/links to publications and other info about the model. Here's a template:
90+
91+
```yml
92+
model_name: My cool foundational model v1
93+
authors:
94+
- family-names: Doe
95+
given-names: John
96+
affiliation: Some University, Some National Lab
97+
98+
orcid: https://orcid.org/0000-xxxx-yyyy-zzzz
99+
corresponding: true
100+
role: Model & PR
101+
- family-names: Jane
102+
given-names: Doe
103+
affiliation: Some National Lab
104+
105+
orcid: https://orcid.org/0000-xxxx-yyyy-zzzz
106+
role: Model
107+
repo: https://github.com/<user>/<repo>
108+
url: https://<model-docs-or-similar>.org
109+
doi: https://doi.org/10.5281/zenodo.0000000
110+
version: 1.0.0
111+
requirements:
112+
torch: 1.13.0
113+
torch-geometric: 2.0.9
114+
...
115+
notes:
116+
Optional free form multi-line notes that might help others reproduce your results.
117+
```
118+
119+
Only the keys `model_name`, `authors`, `repo`, `version` are required. Arbitrary other keys can be added as needed.
120+
121+
Please see any of subdirectories in [`models/`](https://github.com/janosh/matbench-discovery/tree/main/models) for example submissions. More detailed step-by-step instructions below:
122+
123+
### Step 1: Clone the repo
124+
125+
```sh
126+
git clone https://github.com/janosh/matbench-discovery
127+
```
128+
129+
### Step 2: Commit model preds, script and metadata
130+
131+
Create a new folder
132+
133+
```sh
134+
mkdir models/<model-name>
135+
```
136+
137+
and place the above listed files there. The file structure should look like this:
138+
139+
```txt
140+
matbench-discovery-root
141+
└── models
142+
└── <model name>
143+
├── metadata.yml
144+
├── <yyyy-mm-dd>-<model-name>-preds.(json|csv).gz
145+
├── test_<model-name>.py
146+
├── readme.md # optional
147+
└── train_<model-name>.py # optional
148+
```
149+
150+
You can include arbitrary other supporting files like metadata, model features (below 10MB to keep `git clone` time low) if they are needed to run the model or might help others reproduce your results. For larger files, please upload to Figshare or similar and link them somewhere in your files.
151+
152+
### Step 3: Create a PR to the [Matbench Discovery repo](https://github.com/janosh/matbench-discovery)
153+
154+
Commit your files to the repo on a branch called `<model-name>` and create a pull request (PR) to the Matbench repository.
155+
156+
```sh
157+
git add -a models/<model-name>
158+
git commit -m 'add <model-name> to Matbench Discovery leaderboard`
159+
```
160+
161+
And you're done! Once tests pass and the PR is merged, your model will be added to the leaderboard! 🎉

0 commit comments

Comments
 (0)