You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Voronoi Tessellation with `matminer` featurization piped into `scikit-learn``RandomForestRegressor`
1
+
# Voronoi Random Forest
2
+
3
+
## Model Architecture
4
+
5
+
Voronoi tessellation with `matminer` featurization piped into `scikit-learn`[`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor).
6
+
7
+
## Reasoning
8
+
9
+
The idea behind this combination of features and model was to have an easy-to-implement baseline algorithm. It's a bit outdated in that it uses handcrafted Magpie features (which have been shown to underperform learned features on datasets exceeding ~10^4 samples) but not so weak as to be indefensible. The fact that its featurization based on Voronoi tessellation is invariant to crystal structure relaxation makes it a natural choice for predicting unrelaxed crystal stability.
2
10
3
11
## OOM errors during featurization
4
12
5
-
`multiprocessing` seems to be the cause of out-of-memory errors on large structures. Initially couldn't get the `matminer``MultipleFeaturizer` to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
13
+
There was an obstacle that actually made this model more difficult to train and test than anticipated. `matminer` uses `multiprocessing`which seems to be the cause of out-of-memory errors on large structures. Initially couldn't get [`MultipleFeaturizer`] to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
slurmstepd: error: Detected 52 oom-kill event(s) in StepId=7401930.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
10
18
4:00
11
19
```
12
20
13
-
Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to set`featurizer.set_n_jobs(1)`.
21
+
Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to try`featurizer.set_n_jobs(1)`.
14
22
15
23
## Archive
16
24
17
25
Files in `2022-10-04-rhys-voronoi.zip` received from Rhys via [Slack](https://ml-physics.slack.com/archives/DD8GBBRLN/p1664929946687049). They are unchanged originals.
Copy file name to clipboardExpand all lines: site/src/routes/how-to-contribute/+page.md
+114-5
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
## Installation
2
2
3
-
The recommended way to use this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
3
+
The recommended way to acquire the train and test data for this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
4
4
5
5
```zsh
6
6
pip install matbench-discovery
@@ -14,16 +14,45 @@ Here's an example script of how to download the training and test set files for
14
14
15
15
```py notest
16
16
from matbench_discovery.data import load_train_test
17
-
from matbench_discovery.data import df_wbm
17
+
from matbench_discovery.data import df_wbm, DATA_FILENAMES
18
+
19
+
# any subset of these keys can be passed to load_train_test()
1.`formula`: A compound's unreduced alphabetical formula
29
58
1.`n_sites`: Number of sites in the structure's unit cell
@@ -39,7 +68,7 @@ Column glossary
39
68
40
69
## Direct Download
41
70
42
-
You can also download the data files directly:
71
+
You can also download the data files directly from GitHub:
43
72
44
73
1.[`2022-10-19-wbm-summary.csv`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-10-19-wbm-summary.csv)[[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-10-19-wbm-summary.csv)]: Computed material properties only, no structures. Available properties are VASP energy, formation energy, energy above the convex hull, volume, band gap, number of sites per unit cell, and more. e_form_per_atom and e_above_hull each have 3 separate columns for old, new and no Materials
@@ -50,3 +79,83 @@ You can also download the data files directly:
50
79
1.[`2022-09-19-mp-elemental-ref-energies.json`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-09-19-mp-elemental-ref-energies.json)[[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-09-19-mp-elemental-ref-energies.json)]: Minimum energy PDEntries for each element present in the Materials Project
To add a new model to this benchmark, please create a pull request to the `main` branch of <https://github.com/janosh/matbench-discovery> that includes at least these 3 required files:
86
+
87
+
1.`<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz`: Your model's energy predictions for all ~250k WBM compounds as compressed JSON or CSV. Recommended way to create this file is with `pandas.DataFrame.to_{json|csv}('<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz')`. JSON is preferred over CSV if your model not only predicts energies (floats) but also Python objects like e.g. pseudo-relaxed structures (see the M3GNet and BOWSR test scripts).
88
+
1.`test_<model-name>.(py|ipynb)`: The Python script or Jupyter notebook used to generate the energy predictions. Ideally, this file should have comments explaining at a high level what the code is doing and how the model works so others can understand and reproduce your results. If the model deployed on this benchmark was trained specifically for this purpose (i.e. if you wrote any training/fine-tuning code while preparing your PR), please also include it as `train_<model-name>.(py|ipynb)`.
89
+
1.`metadata.yml`: A file to record all relevant metadata your algorithm like model name, authors (can be different for the model and the PR), package requirements, relevant citations/links to publications and other info about the model. Here's a template:
Optional free form multi-line notes that might help others reproduce your results.
117
+
```
118
+
119
+
Only the keys `model_name`, `authors`, `repo`, `version` are required. Arbitrary other keys can be added as needed.
120
+
121
+
Please see any of subdirectories in [`models/`](https://github.com/janosh/matbench-discovery/tree/main/models) for example submissions. More detailed step-by-step instructions below:
### Step 2: Commit model preds, script and metadata
130
+
131
+
Create a new folder
132
+
133
+
```sh
134
+
mkdir models/<model-name>
135
+
```
136
+
137
+
and place the above listed files there. The file structure should look like this:
138
+
139
+
```txt
140
+
matbench-discovery-root
141
+
└── models
142
+
└── <model name>
143
+
├── metadata.yml
144
+
├── <yyyy-mm-dd>-<model-name>-preds.(json|csv).gz
145
+
├── test_<model-name>.py
146
+
├── readme.md # optional
147
+
└── train_<model-name>.py # optional
148
+
```
149
+
150
+
You can include arbitrary other supporting files like metadata, model features (below 10MB to keep `git clone` time low) if they are needed to run the model or might help others reproduce your results. For larger files, please upload to Figshare or similar and link them somewhere in your files.
151
+
152
+
### Step 3: Create a PR to the [Matbench Discovery repo](https://github.com/janosh/matbench-discovery)
153
+
154
+
Commit your files to the repo on a branch called `<model-name>` and create a pull request (PR) to the Matbench repository.
155
+
156
+
```sh
157
+
git add -a models/<model-name>
158
+
git commit -m 'add <model-name> to Matbench Discovery leaderboard`
159
+
```
160
+
161
+
And you're done! Once tests pass and the PR is merged, your model will be added to the leaderboard! 🎉
0 commit comments