Skip to content

Commit d491439

Browse files
committed
file tables and improved wording in data/wbm/readme.md
1 parent 6d80e5b commit d491439

File tree

3 files changed

+39
-41
lines changed

3 files changed

+39
-41
lines changed

data/wbm/readme.md

+37-39
Original file line numberDiff line numberDiff line change
@@ -10,59 +10,57 @@ Since repeated substitutions should - on average - increase chemical dissimilari
1010

1111
## About the IDs
1212

13-
As you may have guessed, the first integer in each material ID following the prefix `wbm-` ranges from 1 to 5 and indicates the substitution iteration count. Each iteration has varying numbers of materials counted by the 2nd integer. Note the 2nd integer is not strictly consecutive. A small number of materials (~0.2%) were removed by the data processing steps detailed below. Don't be surprised to find an ID like `wbm-3-70804` followed by
13+
The first integer in each material ID ranging from 1 to 5 and coming right after the prefix `wbm-` indicates the substitution step, i.e. in which iteration of the substitution process was this material generated. Each iteration has varying numbers of materials which are counted by the 2nd integer. Note this 2nd number is not always consecutive. A small number of materials (~0.2%) were removed by the data processing steps detailed below. Don't be surprised to find an ID like `wbm-3-70804` followed by `wbm-3-70807`.
1414

1515
## Data processing steps
1616

17-
The full set of processing steps used to curate the WBM test set from the raw data files (downloaded from the URLs listed below) can be found in [`data/wbm/fetch_process_wbm_dataset.py`](https://github.com/janosh/matbench-discovery/blob/site/data/wbm/fetch_process_wbm_dataset.py). Processing involved
17+
The full set of processing steps used to curate the WBM test set from the raw data files (downloaded from URLs listed below) can be found in [`data/wbm/fetch_process_wbm_dataset.py`](https://github.com/janosh/matbench-discovery/blob/site/data/wbm/fetch_process_wbm_dataset.py). Processing involved
1818

1919
- re-formatting material IDs
20-
- correctly aligning initial structures to DFT-relaxed `ComputedStructureEntries`
20+
- correctly aligning initial structures to DFT-relaxed [`ComputedStructureEntries`](https://pymatgen.org/pymatgen.entries.computed_entries.html#pymatgen.entries.computed_entries.ComputedStructureEntry)
2121
- remove 6 pathological structures (with 0 volume)
22-
- remove formation energy outliers below -5 and above 5 eV/atom (removed 502 and 22 crystals respectively out of 257,487 total, including an anomaly of 500 structures at exactly -10 eV/atom)
22+
- remove formation energy outliers below -5 and above 5 eV/atom (502 and 22 crystals respectively out of 257,487 total, including an anomaly of 500 structures at exactly -10 eV/atom)
2323
<!-- ![WBM formation energy histogram indicating outlier cutoffs](2022-12-07-hist-e-form-per-atom.png) -->
2424
- apply the [`MaterialsProject2020Compatibility`](https://pymatgen.org/pymatgen.entries.compatibility.html#pymatgen.entries.compatibility.MaterialsProject2020Compatibility) energy correction scheme to the formation energies
25-
- compute energy to the convex hull constructed from all MP `ComputedStructureEntries` queried on 2022-09-16 ([database release 2021.05.13](https://docs.materialsproject.org/changes/database-versions#v2021.05.13))
25+
- compute energy to the Materials Project convex hull constructed from all MP `ComputedStructureEntries` queried on 2022-09-16 ([database release 2021.05.13](https://docs.materialsproject.org/changes/database-versions#v2021.05.13))
2626

27-
The number of materials in each step before and after processing are:
27+
Invoking the script `python fetch_process_wbm_dataset.py` will auto-download and regenerate the WBM test set files from scratch. If you find
28+
29+
- any questionable structures or data records in the released test set, or
30+
- inconsistencies between the files on GitHub vs the output of that script,
31+
32+
please [raise an issue](https://github.com/janosh/matbench-discovery/issues).
2833

29-
| step | 1 | 2 | 3 | 4 | 5 | total |
30-
| ---- | ------ | ------ | ------ | ------ | ------ | ------- |
31-
| pre | 61,848 | 52,800 | 79,205 | 40,328 | 23,308 | 257,487 |
32-
| post | 61,466 | 52,755 | 79,160 | 40,314 | 23,268 | 256,963 |
34+
The number of materials in each step before and after processing are:
3335

34-
Invoking that script with `python fetch_process_wbm_dataset.py` will auto-download and regenerate the WBM test set files from scratch. If you find any questionable in the released test set or inconsistencies between the files on GitHub vs the output of that script, please [raise an issue](https://github.com/janosh/matbench-discovery/issues).
36+
| step | 1 | 2 | 3 | 4 | 5 | total |
37+
| ------ | ------ | ------ | ------ | ------ | ------ | ------- |
38+
| before | 61,848 | 52,800 | 79,205 | 40,328 | 23,308 | 257,487 |
39+
| after | 61,466 | 52,755 | 79,160 | 40,314 | 23,268 | 256,963 |
3540

36-
## Links to WBM data files
41+
## Links to raw WBM data files
3742

3843
Links to WBM data files have proliferated. This is an attempt to keep track of all of them.
3944

40-
Initial structures were sent as Google Drive links via email by Hai-Chen Wang on 2021-09-01.
41-
42-
step 1: <https://drive.google.com/file/d/1ZUgtYwrfZn_P8bULWRtTXepyAxHVxS5C>
43-
step 2: <https://drive.google.com/file/d/1-3uu2AcARJxH7GReteGVASZTuttFGiW_>
44-
step 3: <https://drive.google.com/file/d/1hc5BvDiFfTu_tc5F8m7ONSw2OgL9vN6o>
45-
step 4: <https://drive.google.com/file/d/1aMYxG5YJUgMHpbWmHpzL4hRfmP26UQqh>
46-
step 5: <https://drive.google.com/file/d/17kQt2r78ReWle4PhEIOXG7w7BFdezGM1>
47-
summary: <https://drive.google.com/file/d/1639IFUG7poaDE2uB6aISUOi65ooBwCIg>
48-
49-
The `ComputedStructureEntries` for steps 1-3 were also linked from the [WBM Nature paper][wbm paper]:
50-
51-
Index page: <https://tddft.org/bmg/data.php>
52-
step 1 CSEs: <https://tddft.org/bmg/files/data/substitutions_000.json.bz2>
53-
step 2 CSEs: <https://tddft.org/bmg/files/data/substitutions_001.json.bz2>
54-
step 3 CSEs: <https://tddft.org/bmg/files/data/substitutions_002.json.bz2>
55-
CIF files: <https://tddft.org/bmg/files/data/similarity-cifs.tar.gz>
56-
57-
Materials Cloud archive: <https://archive.materialscloud.org/record/2021.68>
58-
File URLs:
59-
60-
- readme: <https://archive.materialscloud.org/record/file?record_id=840&filename=README.txt>
61-
- summary: <https://archive.materialscloud.org/record/file?record_id=840&filename=summary.txt.bz2>
62-
- step 1: <https://archive.materialscloud.org/record/file?record_id=840&filename=step_1.json.bz2>
63-
- step 2: <https://archive.materialscloud.org/record/file?record_id=840&filename=step_2.json.bz2>
64-
- step 3: <https://archive.materialscloud.org/record/file?record_id=840&filename=step_3.json.bz2>
65-
- step 4: <https://archive.materialscloud.org/record/file?record_id=840&filename=step_4.json.bz2>
66-
- step 5: <https://archive.materialscloud.org/record/file?record_id=840&filename=step_5.json.bz2>
45+
Initial structures (after element substitution but before DFT relaxation) were sent as Google Drive links via email by Hai-Chen Wang on 2021-09-01.
46+
47+
### Google Drive
48+
49+
| Google Drive links | [step 1](https://drive.google.com/file/d/1ZUgtYwrfZn_P8bULWRtTXepyAxHVxS5C) | [step 2](https://drive.google.com/file/d/1-3uu2AcARJxH7GReteGVASZTuttFGiW_) | [step 3](https://drive.google.com/file/d/1hc5BvDiFfTu_tc5F8m7ONSw2OgL9vN6o) | [step 4](https://drive.google.com/file/d/1aMYxG5YJUgMHpbWmHpzL4hRfmP26UQqh) | [step 5](https://drive.google.com/file/d/17kQt2r78ReWle4PhEIOXG7w7BFdezGM1) | [summary](https://drive.google.com/file/d/1639IFUG7poaDE2uB6aISUOi65ooBwCIg) |
50+
| ------------------ | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
51+
52+
### Halle University
53+
54+
The [paper itself][wbm paper] links to a [Halle University data page](https://tddft.org/bmg/data.php) which lists download URLs for CIF files and the `ComputedStructureEntries` (CSEs) of steps 1-3:
55+
56+
| [Halle University links](https://tddft.org/bmg/data.php) | [step 1 CSEs](https://tddft.org/bmg/files/data/substitutions_000.json.bz2) | [step 2 CSEs](https://tddft.org/bmg/files/data/substitutions_001.json.bz2) | [step 3 CSEs](https://tddft.org/bmg/files/data/substitutions_002.json.bz2) | [CIF files](https://tddft.org/bmg/files/data/similarity-cifs.tar.gz) |
57+
| -------------------------------------------------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------- |
58+
59+
### Materials Cloud
60+
61+
materialscloud:2021.68 includes a readme file with a description of the dataset, meanings of the summary CSV columns and a Python script for loading the data.
62+
63+
| [Materials Cloud archive](https://archive.materialscloud.org/record/2021.68) | [step 1](https://archive.materialscloud.org/record/file?record_id=840&filename=step_1.json.bz2) | [step 2](https://archive.materialscloud.org/record/file?record_id=840&filename=step_2.json.bz2) | [step 3](https://archive.materialscloud.org/record/file?record_id=840&filename=step_3.json.bz2) | [step 4](https://archive.materialscloud.org/record/file?record_id=840&filename=step_4.json.bz2) | [step 5](https://archive.materialscloud.org/record/file?record_id=840&filename=step_5.json.bz2) | [summary](https://archive.materialscloud.org/record/file?record_id=840&filename=summary.txt.bz2) | [readme](https://archive.materialscloud.org/record/file?record_id=840&filename=README.txt) |
64+
| ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------ |
6765

6866
[wbm paper]: https://nature.com/articles/s41524-020-00481-6

readme.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Matbench Discovery
1313

1414
</h4>
1515

16-
Matbench is an [interactive leaderboard](https://matbench-discovery.janosh.dev/figures) and associated [PyPI package](https://pypi.org/project/matbench-discovery) for benchmarking ML energy models on a task designed to closely emulate a real-world computational materials discovery workflow in which these models would be used for a pre-triaging step to determine how to allocate limited compute budget on DFT structure relaxations.
16+
Matbench Discovery is an [interactive leaderboard](https://matbench-discovery.janosh.dev/figures) and associated [PyPI package](https://pypi.org/project/matbench-discovery) for benchmarking ML energy models on a task designed to closely emulate a real-world computational materials discovery workflow in which these models would be used for a pre-triaging step to determine how to allocate limited compute budget on DFT structure relaxations.
1717

1818
We welcome contributions that add new models to the leaderboard through [GitHub PRs](https://github.com/janosh/matbench-discovery/pulls).
1919

site/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"preview": "vite preview",
1414
"serve": "vite build && vite preview",
1515
"check": "svelte-check",
16-
"make-api-docs": "cd .. && lazydocs matbench_discovery --output-path site/src/routes/api --no-watermark --src-base-url https://github.com/janosh/matbench-discovery/blob/main"
16+
"make-api-docs": "rm -f src/routes/api/*.md && cd .. && lazydocs matbench_discovery --output-path site/src/routes/api --no-watermark --src-base-url https://github.com/janosh/matbench-discovery/blob/main"
1717
},
1818
"devDependencies": {
1919
"@iconify/svelte": "^3.0.1",

0 commit comments

Comments
 (0)