add citation.cff, add step-by-step guide for adding new models to the leaderboard to site/src/routes/how-to-contribute/+page.md

janosh · janosh · commit 4121f4985008 · 2023-06-19T20:29:22.000-07:00
motivate choice of Voronoi RF as baseline model in models/voronoi/readme.md
diff --git a/citation.cff b/citation.cff
@@ -0,0 +1,38 @@
+cff-version: 1.2.0
+title: Matbench Discovery
+message: If you use this software, please cite it as below.
+authors:
+  - family-names: Riebesell
+    given-names: Janosh
+    affiliation: University of Cambridge, Lawrence Berkeley National Laboratory
+    email: janosh@lbl.gov
+    orcid: https://orcid.org/0000-0001-5233-3462
+    corresponding: true
+  - family-names: Goodall
+    given-names: Rhys
+    affiliation: University of Cambridge
+    orcid: https://orcid.org/0000-0002-6589-1700
+  - family-names: Jain
+    given-name: Anubhav
+    orcid: 0000-0001-5893-9967
+    affiliation: Lawrence Berkeley National Laboratory
+  - family-names: Persson
+    given-name: Kristin
+    orcid: 0000-0003-2495-5509
+    affiliation: Lawrence Berkeley National Laboratory
+  - family-names: King-Smith
+    given-name: Emma
+    orcid: 0000-0002-2999-0955
+    affiliation: University of Cambridge
+  - family-names: Lee
+    given-name: Alpha
+    orcid: 0000-0002-9616-3108
+    affiliation: University of Cambridge
+license: MIT
+license-url: https://github.com/janosh/matbench-discovery/blob/main/license"
+repository-code: https://github.com/janosh/matbench-discovery
+type: software
+url: https://github.com/janosh/matbench-discovery
+doi: TODO
+version: 1.0.0 # replace with whatever Matbench Discovery version you use
+date-released: TODO
diff --git a/models/voronoi/readme.md b/models/voronoi/readme.md
@@ -1,17 +1,27 @@
-# Voronoi Tessellation with `matminer` featurization piped into `scikit-learn` `RandomForestRegressor`
+# Voronoi Random Forest
+
+## Model Architecture
+
+Voronoi tessellation with `matminer` featurization piped into `scikit-learn` [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor).
+
+## Reasoning
+
+The idea behind this combination of features and model was to have an easy-to-implement baseline algorithm. It's a bit outdated in that it uses handcrafted Magpie features (which have been shown to underperform learned features on datasets exceeding ~10^4 samples) but not so weak as to be indefensible. The fact that its featurization based on Voronoi tessellation is invariant to crystal structure relaxation makes it a natural choice for predicting unrelaxed crystal stability.
 
 ## OOM errors during featurization
 
-`multiprocessing` seems to be the cause of out-of-memory errors on large structures. Initially couldn't get the `matminer` `MultipleFeaturizer` to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
+There was an obstacle that actually made this model more difficult to train and test than anticipated. `matminer` uses `multiprocessing` which seems to be the cause of out-of-memory errors on large structures. Initially couldn't get [`MultipleFeaturizer`] to run without crashing even when running on small subsets of the data (1%) and setting `sbatch` flag `--mem 100G`:
 
 ```log
 MultipleFeaturizer:  28%|██▊       | 724/2575 [01:08<04:15,  7.25it/s]/var/spool/slurm/slurmd/job7401930/slurm_script: line 4: 2625851 Killed                  python
 slurmstepd: error: Detected 52 oom-kill event(s) in StepId=7401930.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
 4:00
 ```
 
-Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to set `featurizer.set_n_jobs(1)`.
+Saving tip came from [Alex Dunn via Slack](https://berkeleytheory.slack.com/archives/D03ULSTNRMX/p1668746161675349) to try `featurizer.set_n_jobs(1)`.
 
 ## Archive
 
 Files in `2022-10-04-rhys-voronoi.zip` received from Rhys via [Slack](https://ml-physics.slack.com/archives/DD8GBBRLN/p1664929946687049). They are unchanged originals.
+
+[`multiplefeaturizer`]: https://hackingmaterials.lbl.gov/matminer/matminer.featurizers.html#matminer.featurizers.base.MultipleFeaturizer
diff --git a/site/src/routes/how-to-contribute/+page.md b/site/src/routes/how-to-contribute/+page.md
@@ -1,6 +1,6 @@
 ## Installation
 
-The recommended way to use this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
+The recommended way to acquire the train and test data for this benchmark is through its Python package [available onPyPI](https://pypi.org/project/matbench-discovery):
 
 ```zsh
 pip install matbench-discovery
@@ -14,16 +14,45 @@ Here's an example script of how to download the training and test set files for
 
 ```py notest
 from matbench_discovery.data import load_train_test
-from matbench_discovery.data import df_wbm
+from matbench_discovery.data import df_wbm, DATA_FILENAMES
+
+# any subset of these keys can be passed to load_train_test()
+assert sorted(DATA_FILENAMES) == [
+    "mp-computed-structure-entries",
+    "mp-elemental-ref-energies",
+    "mp-energies",
+    "mp-patched-phase-diagram",
+    "wbm-computed-structure-entries",
+    "wbm-initial-structures",
+    "wbm-summary",
+]
 
 df_wbm = load_train_test("wbm-summary", version="v1.0.0")
 
 assert df_wbm.shape == (256963, 17)
 
-assert list(df_wbm) == ['formula', 'n_sites', 'volume', 'uncorrected_energy', 'e_form_per_atom_wbm', 'e_hull_wbm', 'bandgap_pbe', 'uncorrected_energy_from_cse', 'e_correction_per_atom_legacy', 'e_correction_per_atom_mp2020', 'e_above_hull_uncorrected_ppd_mp', 'e_above_hull_mp2020_corrected_ppd_mp', 'e_above_hull_legacy_corrected_ppd_mp', 'e_form_per_atom_uncorrected', 'e_form_per_atom_mp2020_corrected', 'e_form_per_atom_legacy_corrected', 'wyckoff_spglib']
+assert list(df_wbm) == [
+    "formula",
+    "n_sites",
+    "volume",
+    "uncorrected_energy",
+    "e_form_per_atom_wbm",
+    "e_hull_wbm",
+    "bandgap_pbe",
+    "uncorrected_energy_from_cse",
+    "e_correction_per_atom_legacy",
+    "e_correction_per_atom_mp2020",
+    "e_above_hull_uncorrected_ppd_mp",
+    "e_above_hull_mp2020_corrected_ppd_mp",
+    "e_above_hull_legacy_corrected_ppd_mp",
+    "e_form_per_atom_uncorrected",
+    "e_form_per_atom_mp2020_corrected",
+    "e_form_per_atom_legacy_corrected",
+    "wyckoff_spglib",
+]
 ```
 
-Column glossary
+`"wbm-summary"` column glossary:
 
 1. `formula`: A compound's unreduced alphabetical formula
 1. `n_sites`: Number of sites in the structure's unit cell
@@ -39,7 +68,7 @@ Column glossary
 
 ## Direct Download
 
-You can also download the data files directly:
+You can also download the data files directly from GitHub:
 
 1. [`2022-10-19-wbm-summary.csv`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-10-19-wbm-summary.csv) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-10-19-wbm-summary.csv)]: Computed material properties only, no structures. Available properties are VASP energy, formation energy, energy above the convex hull, volume, band gap, number of sites per unit cell, and more. e_form_per_atom and e_above_hull each have 3 separate columns for old, new and no Materials
 1. [`2022-10-19-wbm-init-structs.json`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-10-19-wbm-init-structs.json) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-10-19-wbm-init-structs.json)]: Unrelaxed WBM structures
@@ -50,3 +79,83 @@ You can also download the data files directly:
 1. [`2022-09-19-mp-elemental-ref-energies.json`](https://github.com/janosh/matbench-discovery/raw/v1.0.0/data/wbm/2022-09-19-mp-elemental-ref-energies.json) [[GitHub](https://github.com/janosh/matbench-discovery/blob/v1/data/wbm/2022-09-19-mp-elemental-ref-energies.json)]: Minimum energy PDEntries for each element present in the Materials Project
 
 [wbm paper]: https://nature.com/articles/s41524-020-00481-6
+
+## How to submit a new model to the leaderboard
+
+To add a new model to this benchmark, please create a pull request to the `main` branch of <https://github.com/janosh/matbench-discovery> that includes at least these 3 required files:
+
+1. `<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz`: Your model's energy predictions for all ~250k WBM compounds as compressed JSON or CSV. Recommended way to create this file is with `pandas.DataFrame.to_{json|csv}('<yyyy-mm-dd>-<model-name>-preds.(json|csv).gz')`. JSON is preferred over CSV if your model not only predicts energies (floats) but also Python objects like e.g. pseudo-relaxed structures (see the M3GNet and BOWSR test scripts).
+1. `test_<model-name>.(py|ipynb)`: The Python script or Jupyter notebook used to generate the energy predictions. Ideally, this file should have comments explaining at a high level what the code is doing and how the model works so others can understand and reproduce your results. If the model deployed on this benchmark was trained specifically for this purpose (i.e. if you wrote any training/fine-tuning code while preparing your PR), please also include it as `train_<model-name>.(py|ipynb)`.
+1. `metadata.yml`: A file to record all relevant metadata your algorithm like model name, authors (can be different for the model and the PR), package requirements, relevant citations/links to publications and other info about the model. Here's a template:
+
+   ```yml
+   model_name: My cool foundational model v1
+   authors:
+     - family-names: Doe
+       given-names: John
+       affiliation: Some University, Some National Lab
+       email: john@doe.gov
+       orcid: https://orcid.org/0000-xxxx-yyyy-zzzz
+       corresponding: true
+       role: Model & PR
+     - family-names: Jane
+       given-names: Doe
+       affiliation: Some National Lab
+       email: jane@doe.edu
+       orcid: https://orcid.org/0000-xxxx-yyyy-zzzz
+       role: Model
+   repo: https://github.com/<user>/<repo>
+   url: https://<model-docs-or-similar>.org
+   doi: https://doi.org/10.5281/zenodo.0000000
+   version: 1.0.0
+   requirements:
+     torch: 1.13.0
+     torch-geometric: 2.0.9
+     ...
+   notes:
+     Optional free form multi-line notes that might help others reproduce your results.
+   ```
+
+   Only the keys `model_name`, `authors`, `repo`, `version` are required. Arbitrary other keys can be added as needed.
+
+Please see any of subdirectories in [`models/`](https://github.com/janosh/matbench-discovery/tree/main/models) for example submissions. More detailed step-by-step instructions below:
+
+### Step 1: Clone the repo
+
+```sh
+git clone https://github.com/janosh/matbench-discovery
+```
+
+### Step 2: Commit model preds, script and metadata
+
+Create a new folder
+
+```sh
+mkdir models/<model-name>
+```
+
+and place the above listed files there. The file structure should look like this:
+
+```txt
+matbench-discovery-root
+└── models
+    └── <model name>
+        ├── metadata.yml
+        ├── <yyyy-mm-dd>-<model-name>-preds.(json|csv).gz
+        ├── test_<model-name>.py
+        ├── readme.md # optional
+        └── train_<model-name>.py # optional
+```
+
+You can include arbitrary other supporting files like metadata, model features (below 10MB to keep `git clone` time low) if they are needed to run the model or might help others reproduce your results. For larger files, please upload to Figshare or similar and link them somewhere in your files.
+
+### Step 3: Create a PR to the [Matbench Discovery repo](https://github.com/janosh/matbench-discovery)
+
+Commit your files to the repo on a branch called `<model-name>` and create a pull request (PR) to the Matbench repository.
+
+```sh
+git add -a models/<model-name>
+git commit -m 'add <model-name> to Matbench Discovery leaderboard`
+```
+
+And you're done! Once tests pass and the PR is merged, your model will be added to the leaderboard! 🎉