Skip to content

Commit 85892aa

Browse files
authored
Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading (#231)
* fix: changed RMSD units from Ångström to unitless across all model YAML files * - fix `calc_geo_opt_metrics` to handle NaN values appropriately by filling them with 1.0 - scripts/evals/geo_opt.py add note explaining RMSD values are unitless * rename `pred_vs_ref_struct_symmetry` to `calc_structure_distances` - after moving spacegroup comparison out of pred_vs_ref_struct_symmetry to analyze_geo_opt.py - calc_structure_distances now only calculates distance metrics between predicted and reference structures - fix tests to cover new functionality and ensure robustness against mismatched IDs * analyze_geo_opt.py add CLI flag analysis_type`: 'all', 'symmetry', or 'distance' - Conditional execution of symmetry analysis and structure distance calculations based on the specified analysis type * convert all model geo_opt pred files from .json.gz to .jsonl.gz for faster loading on debug runs uploaded to new figshare article: https://figshare.com/articles/dataset/28642406 old one is now deprecated: https://figshare.com/articles/dataset/28187999 symmetry and distance analysis files to be added to new article next * add read-only mode to `update_yaml_at_path` by passing data=None - update_yaml_at_path now allows reading from a YAML file when `data` is set to None, returning the value at the specified dotted path without modifying the file - tests to verify read-only functionality * `write_metrics_to_yaml` now accepts metrics as either a DataFrame or a dictionary - Refined `calc_geo_opt_metrics` to handle NaN values - more test coverage for both functions * update article ID for `model_preds_geo_opt in `figshare.py` `ARTICLE_IDS` - Modified `calc_structure_distances` in `symmetry.py` to print a warning instead of raising an error when no shared IDs between predicted and reference structures - new test cases in `test_symmetry.py` to verify the new warning behavior and ensure proper handling of NaN values in distance calculations * add tests/metrics/test_analyze_geo_opt.py unit tests for geometry optimization analysis in `test_analyze_geo_opt.py` - bump pre-commit hooks for ruff, eslint, and pyright * removed n_structures field from YAML files since already included in analysis filename - updated analysis file paths to include structure counts in filenames for consistency - modified RMSD values to reflect updated config in several models * change all WBM initial/relaxed structure `pd.read_json` calls to use `lines=True` for new JSON files in line-delimited format - Updated relevant scripts and models to ensure compatibility * re-update WBM computed structure entries and initial structures and update figshare URLs in data-files.yml - remove DataFiles.wbm_cses_plus_init_structs altogether, usually you just need one or the other, not both initial and relaxed structures - change all references from `wbm_cses_plus_init_structs` to `wbm_initial_structures` and `wbm_computed_structure_entries` in scripts and models - enhance upload script with argparse for file selection * rename all `df_cse` variables to `df_wbm_cse` or `df_mp_cse` for readability * specify JSON Lines format for model-relaxed structures in contributing.md and PR template.md update data-files.yml to reflect changes in file paths from .json.bz2 to .jsonl.gz for WBM computed and initial structures, removal of wbm_cses_plus_init_structs * temp revert to previous metrics.geo_opt format for now * fix pytest * revert half-baked update_yaml_at_path() and write_metrics_to_yaml() changes for now * add models/mattersim/extract_final_structs_from_relax_traj_take2.py The original trajectories can be found at: https://figshare.com/s/a629acbf3bed6a04b3ce?file=53060504 * migrate model scripts for geo_opt test or pred joining to write relaxed structures in JSON Lines format * revert code changes in scripts/analyze_geo_opt.py + matbench_discovery/structure/symmetry.py + matbench_discovery/data.py + tests/structure/test_symmetry.py keep only new paragraph in module doc str * reapply minimal RMSD fixes as proposed in #230
1 parent c41df43 commit 85892aa

File tree

88 files changed

+394
-296
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+394
-296
lines changed

.github/pull_request_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Please check the following items before submitting your PR:
99
- [ ] I have created a new folder and YAML metadata file `models/<arch_name>/<model_variant>.yml` for my submission. `arch_name` is the name of the architecture and `model_variant.yml` includes things like author details, training set names and important hyperparameters.
1010
- [ ] I have added the my new model as a new attribute on the [`Model.<arch_name>` enum](https://github.com/janosh/matbench-discovery/blob/57d0d0c8a14cd317/matbench_discovery/enums.py#L274) in `enums.py`.
1111
- [ ] I have uploaded the energy/force/stress model prediction file for the WBM test set to Figshare or another cloud storage service (`<yyyy-mm-dd>-<model_variant>-preds.csv.gz`).
12-
- [ ] I have uploaded the model-relaxed structures file to Figshare or another cloud storage service (`<yyyy-mm-dd>-wbm-IS2RE-FIRE.json.gz`).
12+
- [ ] I have uploaded the model-relaxed structures file to Figshare or another cloud storage service in [JSON lines format](https://jsonlines.org) (`<yyyy-mm-dd>-wbm-IS2RE-FIRE.jsonl.gz`). JSON Lines allows fast loading of small numbers of structures with `pandas.read_json(lines=True, nrows=100)` for inspection.
1313
- [ ] I have uploaded the phonon predictions to Figshare or another cloud storage service (`<yyyy-mm-dd>-kappa-103-FIRE-<values-of-dist|fmax|symprec>.gz`).
1414
- [ ] I have included the urls to the Figshare files in the YAML metadata file (`models/<arch_name>/<model_variant>.yml`). If not using Figshare I have included the urls to the cloud storage service in the description of the PR.
1515
- [ ] I have included the test script (`test_<arch_name>_<task>.py` for `task` in `discovery`, `kappa`, `diatomics`) that generated the prediction files.

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ default_install_hook_types: [pre-commit, commit-msg]
88

99
repos:
1010
- repo: https://github.com/astral-sh/ruff-pre-commit
11-
rev: v0.11.0
11+
rev: v0.11.2
1212
hooks:
1313
- id: ruff
1414
args: [--fix]
@@ -57,7 +57,7 @@ repos:
5757
exclude: ^(site/src/figs/.+\.svelte|data/wbm/20.+\..+|site/src/(routes|figs).+\.(yaml|json)|changelog.md)$
5858

5959
- repo: https://github.com/pre-commit/mirrors-eslint
60-
rev: v9.22.0
60+
rev: v9.23.0
6161
hooks:
6262
- id: eslint
6363
types: [file]
@@ -84,7 +84,7 @@ repos:
8484
- id: check-github-actions
8585

8686
- repo: https://github.com/RobertCraigie/pyright-python
87-
rev: v1.1.396
87+
rev: v1.1.397
8888
hooks:
8989
- id: pyright
9090
args: [--level, error]

contributing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ To submit a new model to this benchmark and add it to our leaderboard, please cr
1717

1818
1. You should share your model's predictions through a cloud storage service (we recommend [Figshare](https://figshare.com)) and include the download links in your PR description. Your cloud storage directory should contain files in a compressed format with the following naming convention: `<arch-name>/<model-variant>/<yyyy-mm-dd>-<eval-task>.{csv.gz|json.gz}`. For example, a in the case of MACE-MP-0, the file paths would be:
1919

20-
- geometry optimization: `mace/mace-mp-0/2023-12-11-wbm-IS2RE-FIRE.json.gz`
20+
- geometry optimization: `mace/mace-mp-0/2023-12-11-wbm-IS2RE-FIRE.jsonl.gz` (use [JSON Lines format](https://jsonlines.org) for fast loading of small numbers of structures with `pandas.read_json(lines=True, nrows=100)` for inspection)
2121
- discovery: `mace/mace-mp-0/2023-12-11-wbm-IS2RE.csv.gz`
2222
- phonons: `mace/mace-mp-0/2024-11-09-kappa-103-FIRE-dist=0.01-fmax=1e-4-symprec=1e-5.json.gz`
2323

data/mp/build_phase_diagram.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,10 @@
3333
df_mp_cse.index.name = Key.mat_id
3434
df_mp_cse.index = [e.entry_id for e in df_mp_cse.entry]
3535
df_mp_cse.reset_index().to_json(
36-
f"{module_dir}/{today}-mp-computed-structure-entries.json.gz",
36+
f"{module_dir}/{today}-mp-computed-structure-entries.jsonl.gz",
3737
default_handler=lambda x: x.as_dict(),
38+
orient="records",
39+
lines=True,
3840
)
3941

4042

@@ -74,7 +76,7 @@
7476

7577
# %% build phase diagram with both MP entries + WBM entries
7678
wbm_cse_path = DataFiles.wbm_computed_structure_entries.path
77-
df_wbm = pd.read_json(wbm_cse_path).set_index(Key.mat_id)
79+
df_wbm = pd.read_json(wbm_cse_path, lines=True).set_index(Key.mat_id)
7880

7981
# using ComputedStructureEntry vs ComputedEntry here is important as CSEs receive
8082
# more accurate energy corrections that take into account peroxide/superoxide nature

data/mp/get_mp_energies.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,21 +60,21 @@
6060

6161

6262
# %%
63-
df_cse = pd.read_json(DataFiles.mp_computed_structure_entries.path).set_index(
63+
df_mp_cse = pd.read_json(DataFiles.mp_computed_structure_entries.path).set_index(
6464
Key.mat_id
6565
)
6666

67-
df_cse[Key.structure] = [
67+
df_mp_cse[Key.structure] = [
6868
Structure.from_dict(cse[Key.structure])
69-
for cse in tqdm(df_cse.entry, desc="Hydrating structures")
69+
for cse in tqdm(df_mp_cse.entry, desc="Hydrating structures")
7070
]
71-
df_cse[f"{Key.protostructure}_moyo"] = [
71+
df_mp_cse[f"{Key.protostructure}_moyo"] = [
7272
prototype.get_protostructure_label(struct)
73-
for struct in tqdm(df_cse.structure, desc="Calculating proto-structure labels")
73+
for struct in tqdm(df_mp_cse.structure, desc="Calculating proto-structure labels")
7474
]
7575
# make sure symmetry detection succeeded for all structures
76-
assert df_cse[f"{Key.protostructure}_moyo"].str.startswith("invalid").sum() == 0
77-
df_mp[f"{Key.protostructure}_moyo"] = df_cse[f"{Key.protostructure}_moyo"]
76+
assert df_mp_cse[f"{Key.protostructure}_moyo"].str.startswith("invalid").sum() == 0
77+
df_mp[f"{Key.protostructure}_moyo"] = df_mp_cse[f"{Key.protostructure}_moyo"]
7878

7979
spg_nums = df_mp[f"{Key.protostructure}_moyo"].str.split("_").str[2].astype(int)
8080
# make sure all our spacegroup numbers match MP's

data/pmg_structs_to_ase_extxyz.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,9 @@
5353

5454
# %% convert WBM initial structures to ASE Atoms (no properties other than material ID
5555
# included in Atoms.info)
56-
df_wbm_init = pd.read_json(DataFiles.wbm_initial_structures.path).set_index(Key.mat_id)
56+
df_wbm_init = pd.read_json(DataFiles.wbm_initial_structures.path, lines=True).set_index(
57+
Key.mat_id
58+
)
5759

5860
wbm_init_atoms_list: list[Atoms] = []
5961
for mat_id, struct_dict in tqdm(df_wbm_init[Key.init_struct].items(), desc="WBM init"):
@@ -68,7 +70,7 @@
6870
# %% convert WBM ComputedStructureEntries to ASE Atoms (material ID and energy included
6971
# in Atoms.info)
7072
wbm_cse_path = DataFiles.wbm_computed_structure_entries.path
71-
df_wbm_cse = pd.read_json(wbm_cse_path).set_index(Key.mat_id)
73+
df_wbm_cse = pd.read_json(wbm_cse_path, lines=True).set_index(Key.mat_id)
7274

7375
wbm_cse_atoms_list: list[Atoms] = []
7476
for mat_id, cse_dict in tqdm(

data/wbm/compare_cse_vs_ce_mp_2020_corrections.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,29 +23,29 @@
2323
from matbench_discovery.enums import DataFiles
2424

2525
wbm_cse_path = DataFiles.wbm_computed_structure_entries.path
26-
df_cse = pd.read_json(wbm_cse_path).set_index(Key.mat_id)
26+
df_wbm_cse = pd.read_json(wbm_cse_path, lines=True).set_index(Key.mat_id)
2727

2828
cses = [
2929
ComputedStructureEntry.from_dict(dct)
3030
for dct in tqdm(
31-
df_cse[Key.computed_structure_entry],
31+
df_wbm_cse[Key.computed_structure_entry],
3232
desc="Loading ComputedStructureEntries",
3333
)
3434
]
3535

3636
ces = [
3737
ComputedEntry.from_dict(dct)
3838
for dct in tqdm(
39-
df_cse[Key.computed_structure_entry], desc="Loading ComputedEntries"
39+
df_wbm_cse[Key.computed_structure_entry], desc="Loading ComputedEntries"
4040
)
4141
]
4242

4343

4444
# %%
4545
processed = MaterialsProject2020Compatibility().process_entries(cses, verbose=True)
46-
assert len(processed) == len(df_cse)
46+
assert len(processed) == len(df_wbm_cse)
4747
processed = MaterialsProject2020Compatibility().process_entries(ces, verbose=True)
48-
assert len(processed) == len(df_cse)
48+
assert len(processed) == len(df_wbm_cse)
4949

5050
df_wbm["e_form_per_atom_mp2020_from_ce"] = [
5151
get_e_form_per_atom(entry)
@@ -66,9 +66,9 @@
6666

6767
# %%
6868
processed = MaterialsProjectCompatibility().process_entries(cses, verbose=True)
69-
assert len(processed) == len(df_cse)
69+
assert len(processed) == len(df_wbm_cse)
7070
processed = MaterialsProjectCompatibility().process_entries(ces, verbose=True)
71-
assert len(processed) == len(df_cse)
71+
assert len(processed) == len(df_wbm_cse)
7272

7373
df_wbm["e_form_per_atom_legacy_from_ce"] = [
7474
get_e_form_per_atom(entry) for entry in tqdm(ces)

data/wbm/compile_wbm_test_set.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -703,16 +703,3 @@ def fix_bad_struct_index_mismatch(material_id: str) -> str:
703703

704704
# %% write final summary data to disk (yeah!)
705705
df_summary.round(6).to_csv(f"{WBM_DIR}/{today}-wbm-summary.csv.gz")
706-
707-
708-
# %% only here to load data for later inspection
709-
if False:
710-
df_summary = pd.read_csv(DataFiles.wbm_summary.path).set_index(Key.mat_id)
711-
df_wbm = pd.read_json(DataFiles.wbm_cses_plus_init_structs.path).set_index(
712-
Key.mat_id
713-
)
714-
715-
df_wbm[Key.computed_structure_entry] = [
716-
ComputedStructureEntry.from_dict(dct)
717-
for dct in tqdm(df_wbm[Key.computed_structure_entry])
718-
]

data/wbm/eda_wbm.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -359,15 +359,20 @@
359359

360360

361361
# %%
362-
df_wbm_structs = pd.read_json(DataFiles.wbm_cses_plus_init_structs.path)
363-
df_wbm_structs = df_wbm_structs.set_index(Key.mat_id)
362+
df_wbm_init_structs = pd.read_json(DataFiles.wbm_initial_structures.path, lines=True)
363+
df_wbm_init_structs = df_wbm_init_structs.set_index(Key.mat_id)
364+
365+
df_wbm_final_structs = pd.read_json(
366+
DataFiles.wbm_computed_structure_entries.path, lines=True
367+
)
368+
df_wbm_final_structs = df_wbm_final_structs.set_index(Key.mat_id)
364369

365370

366371
# %%
367372
for wbm_id in df_sym_change.index:
368-
init_struct = Structure.from_dict(df_wbm_structs.loc[wbm_id][Key.init_struct])
373+
init_struct = Structure.from_dict(df_wbm_init_structs.loc[wbm_id][Key.init_struct])
369374
final_struct = Structure.from_dict(
370-
df_wbm_structs.loc[wbm_id][Key.computed_structure_entry]["structure"]
375+
df_wbm_final_structs.loc[wbm_id][Key.computed_structure_entry]["structure"]
371376
)
372377
init_struct.properties[Key.mat_id] = f"{wbm_id}-init"
373378
final_struct.properties[Key.mat_id] = f"{wbm_id}-final"
@@ -379,11 +384,11 @@
379384
wbm_id = df_sym_change.index[0]
380385

381386
struct = Structure.from_dict(
382-
df_wbm_structs.loc[wbm_id][Key.computed_structure_entry]["structure"]
387+
df_wbm_final_structs.loc[wbm_id][Key.computed_structure_entry]["structure"]
383388
)
384389
struct.to(f"{module_dir}/{wbm_id}.cif")
385390
struct.to(f"{module_dir}/{wbm_id}.json")
386391

387-
struct = Structure.from_dict(df_wbm_structs.loc[wbm_id][Key.init_struct])
392+
struct = Structure.from_dict(df_wbm_init_structs.loc[wbm_id][Key.init_struct])
388393
struct.to(f"{module_dir}/{wbm_id}-init.cif")
389394
struct.to(f"{module_dir}/{wbm_id}-init.json")

matbench_discovery/data-files.yml

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,8 @@ mp_trj_extxyz:
4747
md5: 7f433171e4e5f2ef9304dccd42d5488f
4848

4949
wbm_computed_structure_entries:
50-
url: https://figshare.com/files/40344463
51-
path: wbm/2022-10-19-wbm-computed-structure-entries.json.bz2
50+
url: https://figshare.com/files/53161832
51+
path: wbm/2022-10-19-wbm-computed-structure-entries.jsonl.gz
5252
description: JSON-Serialized `pymatgen` [`ComputedStructureEntries`] containing all WBM DFT-relaxed structures and corresponding final energies
5353
md5: 481959b65f28150ae6ee7297ddeba538
5454

@@ -59,8 +59,8 @@ wbm_relaxed_atoms:
5959
md5: 4726643ac0dfbab69a4284454c891e68
6060

6161
wbm_initial_structures:
62-
url: https://figshare.com/files/40344466
63-
path: wbm/2022-10-19-wbm-init-structs.json.bz2
62+
url: https://figshare.com/files/53161835
63+
path: wbm/2022-10-19-wbm-init-structs.jsonl.gz
6464
description: Unrelaxed WBM structures in `pymatgen` `Structure` format
6565
md5: ff2c40a3a7bf65468852b67f0dbc67df
6666

@@ -70,12 +70,6 @@ wbm_initial_atoms:
7070
description: Unrelaxed WBM structures as `ase` Atoms in extended XYZ format
7171
md5: 2a292211ca6acb30ed8416178d644098
7272

73-
wbm_cses_plus_init_structs:
74-
url: https://figshare.com/files/40344469
75-
path: wbm/2022-10-19-wbm-computed-structure-entries+init-structs.json.bz2
76-
description: Both unrelaxed and DFT-relaxed WBM structures, the latter stored with their final VASP energies as `pymatgen` [`ComputedStructureEntries`]
77-
md5: eaabe984d070156cc50a8a075cd5e315
78-
7973
wbm_summary:
8074
url: https://figshare.com/files/44225498
8175
path: wbm/2023-12-13-wbm-summary.csv.gz

matbench_discovery/data.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -333,7 +333,7 @@ def update_yaml_at_path(
333333
"""
334334
# raise on repeated or trailing dots in dotted path
335335
if not re.match(r"^[a-zA-Z0-9-+=_]+(\.[a-zA-Z0-9-+=_]+)*$", dotted_path):
336-
raise ValueError(f"Invalid dotted path: {dotted_path}")
336+
raise ValueError(f"Invalid {dotted_path=}")
337337

338338
with open(file_path) as file:
339339
yaml_data = round_trip_yaml.load(file)

matbench_discovery/enums.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -458,7 +458,7 @@ class DataFiles(Files):
458458

459459
mp_computed_structure_entries = (
460460
auto(),
461-
("mp/2023-02-07-mp-computed-structure-entries.json.gz"),
461+
"mp/2023-02-07-mp-computed-structure-entries.json.gz",
462462
)
463463
mp_elemental_ref_entries = (
464464
auto(),
@@ -475,24 +475,20 @@ class DataFiles(Files):
475475

476476
wbm_computed_structure_entries = (
477477
auto(),
478-
("wbm/2022-10-19-wbm-computed-structure-entries.json.bz2"),
478+
"wbm/2022-10-19-wbm-computed-structure-entries.jsonl.gz",
479479
)
480480
wbm_relaxed_atoms = auto(), "wbm/2024-08-04-wbm-relaxed-atoms.extxyz.zip"
481-
wbm_initial_structures = auto(), "wbm/2022-10-19-wbm-init-structs.json.bz2"
481+
wbm_initial_structures = auto(), "wbm/2022-10-19-wbm-init-structs.jsonl.gz"
482482
wbm_initial_atoms = auto(), "wbm/2024-08-04-wbm-initial-atoms.extxyz.zip"
483-
wbm_cses_plus_init_structs = (
484-
auto(),
485-
("wbm/2022-10-19-wbm-computed-structure-entries+init-structs.json.bz2"),
486-
)
487483
wbm_summary = auto(), "wbm/2023-12-13-wbm-summary.csv.gz"
488484
alignn_checkpoint = auto(), "2023-06-02-pbenner-best-alignn-model.pth.zip"
489485
phonondb_pbe_103_structures = (
490486
auto(),
491-
("phonons/2024-11-09-phononDB-PBE-103-structures.extxyz"),
487+
"phonons/2024-11-09-phononDB-PBE-103-structures.extxyz",
492488
)
493489
phonondb_pbe_103_kappa_no_nac = (
494490
auto(),
495-
("phonons/2024-11-09-kappas-phononDB-PBE-noNAC.json.gz"),
491+
"phonons/2024-11-09-kappas-phononDB-PBE-noNAC.json.gz",
496492
)
497493
wbm_dft_geo_opt_symprec_1e_2 = (
498494
auto(),

matbench_discovery/metrics/geo_opt.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,13 +87,13 @@ def calc_geo_opt_metrics(df_model_analysis: pd.DataFrame) -> dict[str, float]:
8787
# Get relevant columns
8888
spg_diff = df_model_analysis[MbdKey.spg_num_diff]
8989
n_sym_ops_diff = df_model_analysis[MbdKey.n_sym_ops_diff]
90-
rmsd = df_model_analysis[MbdKey.structure_rmsd_vs_dft]
90+
rmsd_vals = df_model_analysis[MbdKey.structure_rmsd_vs_dft]
9191

9292
# Count total number of structures (excluding NaN values)
9393
n_structs = len(spg_diff.dropna())
9494

95-
# Calculate RMSD and MAE metrics
96-
mean_rmsd = rmsd.mean()
95+
# Fill NaN values with 1.0 (the stol value we set in StructureMatcher)
96+
mean_rmsd = rmsd_vals.fillna(1.0).mean()
9797
sym_ops_mae = n_sym_ops_diff.abs().mean()
9898

9999
# Count cases where spacegroup changed

matbench_discovery/remote/fetch.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def download_file(file_path: str, url: str) -> None:
1919

2020
response.raise_for_status()
2121

22-
with open(file_path, "wb") as file:
22+
with open(file_path, mode="wb") as file:
2323
file.write(response.content)
2424
except requests.exceptions.RequestException:
2525
print(f"Error downloading {url=}\nto {file_path=}.\n{traceback.format_exc()}")

matbench_discovery/remote/figshare.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@
2222
DOWNLOAD_URL_PREFIX: Final = "https://figshare.com/files"
2323
ARTICLE_IDS: Final[dict[str, int | None]] = {
2424
"model_preds_discovery": 28187990,
25-
"model_preds_geo_opt": 28187999,
25+
"model_preds_geo_opt": 28642406,
2626
"model_preds_phonons": 28347251,
27-
"model_preds_diatomics": 28437344, # created 2024-02-13
27+
"model_preds_diatomics": 28437344,
2828
"data_files": 22715158,
2929
}
3030

matbench_discovery/structure/symmetry.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,9 @@ def pred_vs_ref_struct_symmetry(
110110
df_sym_pred[Key.n_sym_ops] - df_sym_ref[Key.n_sym_ops]
111111
)
112112

113-
structure_matcher = StructureMatcher()
113+
# scale=False and stol=1 are important for getting accurate distance of atomic
114+
# positions from DFT-relaxed positions. details in https://github.com/janosh/matbench-discovery/issues/230
115+
structure_matcher = StructureMatcher(stol=1.0, scale=False)
114116
ref_ids, pred_ids = set(ref_structs), set(pred_structs)
115117
shared_ids = ref_ids & pred_ids
116118
if len(shared_ids) == 0:

models/alignn/test_alignn_discovery.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@
7676
}[task_type]
7777
input_col = {Task.IS2RE: Key.init_struct, Task.RS2RE: Key.final_struct}[task_type]
7878

79-
df_in = pd.read_json(data_path).set_index(Key.mat_id)
79+
df_in = pd.read_json(data_path, lines=True).set_index(Key.mat_id)
8080

8181
df_in[target_col] = df_wbm[target_col]
8282
if task_type == Task.RS2RE:

models/alignn/train_alignn.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,17 +51,17 @@
5151

5252

5353
# %% Load data
54-
df_cse = pd.read_json(DataFiles.mp_computed_structure_entries.path).set_index(
54+
df_mp_cse = pd.read_json(DataFiles.mp_computed_structure_entries.path).set_index(
5555
Key.mat_id
5656
)
57-
df_cse[Key.structure] = [
57+
df_mp_cse[Key.structure] = [
5858
Structure.from_dict(cse[Key.structure])
59-
for cse in tqdm(df_cse.entry, desc="Structures from dict")
59+
for cse in tqdm(df_mp_cse.entry, desc="Structures from dict")
6060
]
6161

6262
# load energies
6363
df_in = pd.read_csv(DataFiles.mp_energies.path).set_index(Key.mat_id)
64-
df_in[Key.structure] = df_cse[Key.structure]
64+
df_in[Key.structure] = df_mp_cse[Key.structure]
6565
if target_col not in df_in:
6666
raise TypeError(f"{target_col!s} not in {df_in.columns=}")
6767

0 commit comments

Comments
 (0)