recover num_ranks from previous run to calculate epoch_base #317

Jubeku · 2025-06-10T10:06:34Z

Description

When continue a training run, previously used num_ranks is used to calculate epoch_base.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

This is supposed to fix #280 .

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

FussyDuck · 2025-06-10T10:06:40Z

All committers have signed the CLA.

tjhunter

@Jubeku thanks! I put some suggestions. As @clessig mentioned in the ticket, we should switch to optim steps anyway later on.

tjhunter · 2025-06-16T10:17:28Z

src/weathergen/train/trainer.py

@@ -54,6 +54,10 @@ def init(

        self.devices = self.init_torch()

+        # Get num_ranks of previous, to be continued run before
+        # num_ranks gets overwritten by current setting during init_ddp()
+        self.num_ranks_original = cf.num_ranks if "num_ranks" in cf.keys() else None


it is so cheap to access we should not add it as an extra state in the class

The problem is that the next line will overwrite the "num_ranks" of the original run and adapt it to the current system. That's why it needs to be captured here.

ok. seem my comment below about style, but looks good otherwise.

tjhunter · 2025-06-16T10:20:18Z

src/weathergen/train/trainer.py

@@ -264,7 +268,15 @@ def run(self, cf, run_id_contd=None, epoch_contd=None):
        self.loss_fcts_val = [[getattr(losses, name), w] for name, w in cf.loss_fcts_val]

        # recover epoch when continuing run
-        epoch_base = int(self.cf.istep / len(self.data_loader))
+        if self.num_ranks_original is None:


I would calculate it here (and note the more pythonic way):

num_ranks_original = self.cf.get("num_ranks", None)

tjhunter · 2025-06-16T10:20:31Z

src/weathergen/train/trainer.py

+            epoch_base = int(self.cf.istep / len(self.data_loader))
+        else:
+            len_per_rank = (
+                (len(self.dataset) // self.num_ranks_original) // cf.batch_size


just for my sanity, can ew have (len(self.dataset) // (self.num_ranks_original* cf.batch_size))

clessig · 2025-06-16T14:26:55Z

@Jubeku thanks! I put some suggestions. As @clessig mentioned in the ticket, we should switch to optim steps anyway later on.

We should switch now, unless there is a good reason not to.

tjhunter · 2025-06-23T14:06:22Z

src/weathergen/train/trainer.py

@@ -54,6 +54,10 @@ def init(

        self.devices = self.init_torch()

+        # Get num_ranks of previous, to be continued run before
+        # num_ranks gets overwritten by current setting during init_ddp()
+        self.num_ranks_original = cf.num_ranks if "num_ranks" in cf.keys() else None


ok. seem my comment below about style, but looks good otherwise.

Jubeku · 2025-06-23T15:25:48Z

Thanks for the style suggestion @tjhunter. Changed accordingly.

* recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style

* - Avoid time encoding is 0 - eps in layer norms to 10^-3 - bf16 * Fixed incorrect cast * Make the attention dtype and norm eps configurable * Fix gitignore and add config files * Shuffle config files into sensible folders * Try fp16 * Fix some missing hardcoded * recover num_ranks from previous run to calculate epoch_base (#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <[email protected]> * Introduce tuples instead for strings to avoid TypeError (#392) * Exclude channels from src / target (#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <[email protected]> * add embed_dropout_rate to config v1 (#358) * [402] adds checks to the pull request (#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <[email protected]> * Make the attention dtype and norm eps configurable * Final default config * Implement per-channel logging (#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <[email protected]> * [346] Passing options through the slurm script (#400) * changes * fixes * - Avoid time encoding is 0 - eps in layer norms to 10^-3 - bf16 * Make the attention dtype and norm eps configurable * Fix gitignore and add config files * Clean up configs for PR * Clean up the forgotten HEAD * Apply ruff formatting * Organize imports * Add mlp norm eps to embed targets and pred adapter * Add comment --------- Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Julian Kuehnert <[email protected]> Co-authored-by: Timothy Hunter <[email protected]> Co-authored-by: ankitpatnala <[email protected]> Co-authored-by: Patnala,Ankit <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Kacper Nowak <[email protected]>

* Implement mock IO (#336) * Adapt score class score class (#339) * Implement mock IO * Adapt score class * Removing unused file (#349) * remove database folder (#355) * Small change - CI - pinning the version of formatting (#361) * changes * changes * Update INSTALL.md * Update INSTALL.md * Fixed Exxx lint issues (#284) * Rebased to the latest changes and linted new changes * addressed review comments * addressed review comments * Linted the latest changes. * corrected the formating * corrected the formating * configured ruff to use LF line endings in pyproject.toml * [357] Sub-package for evaluation (#359) * working * changes * removing deps from non-core project * changes * fixes * comments * Iluise quick fix stac (#374) * remove database folder * fix database * Simplifying workflow for plot_training (#368) * Simplifying workflow for plot_training * Ruffed * Working on implementing exclude_source * Remove unused code * Fixed ruff issue * Fixing bug in lat handling (377) (#378) * Fixing bug in lat handling * Added comment --------- Co-authored-by: Seb Hickman <[email protected]> * recover num_ranks from previous run to calculate epoch_base (#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <[email protected]> * Introduce tuples instead for strings to avoid TypeError (#392) * Exclude channels from src / target (#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <[email protected]> * add embed_dropout_rate to config v1 (#358) * [402] adds checks to the pull request (#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <[email protected]> * Implement per-channel logging (#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <[email protected]> * [346] Passing options through the slurm script (#400) * changes * fixes * refactor `validation_io.write_validation` to make it more readable * remove legacy code `validation_io.read_validation` * encapsulate artifact path logic in config module * remove redundant attribute `Trainer.path_run` * use config to look up base_path in `write_validation` * remove unused `write_validation` args: `base_path`, `rank` * ensure correct type for pathes * remove streams initialization from `Trainer` * remove path logic from `Trainer.save_model` * simplify conditional * rename mock io module * update uv to include dask * Implement io module to support reading/writing model output * implement new validation_io routine * use new write_validation routine * remove unused code * rename output routine to `write_output` * ruffed and added comments * fixed annotation * use simple __init__ method for `OutputItem` instead of dataclasses magic * address reviewers comments * rename method * add simple docstrings * ruffed * typehint fixes * refactor names * update comments and typehints, dont import pytorch * remove `__post_init__` methods, cache properties * fixes and integration test * final fixes :) * changes * changes * changes * changes * changes * more work * changes * changes * changes * ruffed * ruffed * improve logging and comments * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Add doc-string to call-method and provide example usage for efficient graph-construction. * Some fixes to score-class. * Some fixes to handling aggregation dimension. * Add missing import of MockIO. * changes * changes * removing the scores * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes --------- Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: Sindhu-Vasireddy <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Julian Kuehnert <[email protected]> Co-authored-by: ankitpatnala <[email protected]> Co-authored-by: Patnala,Ankit <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: Simon Grasse <[email protected]> Co-authored-by: Michael <[email protected]>

* Implement mock IO (ecmwf#336) * Adapt score class score class (ecmwf#339) * Implement mock IO * Adapt score class * Removing unused file (ecmwf#349) * remove database folder (ecmwf#355) * Small change - CI - pinning the version of formatting (ecmwf#361) * changes * changes * Update INSTALL.md * Update INSTALL.md * Fixed Exxx lint issues (ecmwf#284) * Rebased to the latest changes and linted new changes * addressed review comments * addressed review comments * Linted the latest changes. * corrected the formating * corrected the formating * configured ruff to use LF line endings in pyproject.toml * [357] Sub-package for evaluation (ecmwf#359) * working * changes * removing deps from non-core project * changes * fixes * comments * Iluise quick fix stac (ecmwf#374) * remove database folder * fix database * Simplifying workflow for plot_training (ecmwf#368) * Simplifying workflow for plot_training * Ruffed * Working on implementing exclude_source * Remove unused code * Fixed ruff issue * Fixing bug in lat handling (377) (ecmwf#378) * Fixing bug in lat handling * Added comment --------- Co-authored-by: Seb Hickman <[email protected]> * recover num_ranks from previous run to calculate epoch_base (ecmwf#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (ecmwf#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (ecmwf#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <[email protected]> * Introduce tuples instead for strings to avoid TypeError (ecmwf#392) * Exclude channels from src / target (ecmwf#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (ecmwf#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <[email protected]> * add embed_dropout_rate to config v1 (ecmwf#358) * [402] adds checks to the pull request (ecmwf#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (ecmwf#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <[email protected]> * Implement per-channel logging (ecmwf#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <[email protected]> * [346] Passing options through the slurm script (ecmwf#400) * changes * fixes * refactor `validation_io.write_validation` to make it more readable * remove legacy code `validation_io.read_validation` * encapsulate artifact path logic in config module * remove redundant attribute `Trainer.path_run` * use config to look up base_path in `write_validation` * remove unused `write_validation` args: `base_path`, `rank` * ensure correct type for pathes * remove streams initialization from `Trainer` * remove path logic from `Trainer.save_model` * simplify conditional * rename mock io module * update uv to include dask * Implement io module to support reading/writing model output * implement new validation_io routine * use new write_validation routine * remove unused code * rename output routine to `write_output` * ruffed and added comments * fixed annotation * use simple __init__ method for `OutputItem` instead of dataclasses magic * address reviewers comments * rename method * add simple docstrings * ruffed * typehint fixes * refactor names * update comments and typehints, dont import pytorch * remove `__post_init__` methods, cache properties * fixes and integration test * final fixes :) * changes * changes * changes * changes * changes * more work * changes * changes * changes * ruffed * ruffed * improve logging and comments * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Add doc-string to call-method and provide example usage for efficient graph-construction. * Some fixes to score-class. * Some fixes to handling aggregation dimension. * Add missing import of MockIO. * changes * changes * removing the scores * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes --------- Co-authored-by: Kacper Nowak <[email protected]> Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: iluise <[email protected]> Co-authored-by: Sindhu-Vasireddy <[email protected]> Co-authored-by: Seb Hickman <[email protected]> Co-authored-by: Julian Kuehnert <[email protected]> Co-authored-by: ankitpatnala <[email protected]> Co-authored-by: Patnala,Ankit <[email protected]> Co-authored-by: Savvas Melidonis <[email protected]> Co-authored-by: Christian Lessig <[email protected]> Co-authored-by: Till Hauer <[email protected]> Co-authored-by: Simon Grasse <[email protected]> Co-authored-by: Michael <[email protected]>

recover num_ranks from previous run to calculate epoch_base

2445611

github-project-automation bot added this to WeatherGen-dev Jun 10, 2025

Jubeku mentioned this pull request Jun 10, 2025

epoch calculation incorrect when number of GPUs change in train_continue #280

Closed

Jubeku requested a review from clessig June 10, 2025 10:10

set email settings for commits

f4cb628

tjhunter requested changes Jun 16, 2025

View reviewed changes

github-project-automation bot moved this to In Progress in WeatherGen-dev Jun 16, 2025

Jubeku and others added 2 commits June 20, 2025 15:10

addressing Tim's comment

71ba474

make ruff happy

758324b

tjhunter approved these changes Jun 23, 2025

View reviewed changes

improve style

298990b

tjhunter merged commit fc4e874 into develop Jun 24, 2025
3 checks passed

github-project-automation bot moved this from In Progress to Done in WeatherGen-dev Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

recover num_ranks from previous run to calculate epoch_base #317

recover num_ranks from previous run to calculate epoch_base #317

Uh oh!

Jubeku commented Jun 10, 2025 •

edited

Loading

Uh oh!

FussyDuck commented Jun 10, 2025 •

edited

Loading

Uh oh!

tjhunter left a comment

Uh oh!

tjhunter Jun 16, 2025

Uh oh!

Jubeku Jun 18, 2025

Uh oh!

tjhunter Jun 23, 2025

Uh oh!

tjhunter Jun 16, 2025

Uh oh!

tjhunter Jun 16, 2025

Uh oh!

clessig commented Jun 16, 2025

Uh oh!

tjhunter Jun 23, 2025

Uh oh!

Jubeku commented Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

recover num_ranks from previous run to calculate epoch_base #317

recover num_ranks from previous run to calculate epoch_base #317

Uh oh!

Conversation

Jubeku commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

FussyDuck commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

clessig commented Jun 16, 2025

Uh oh!

tjhunter Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Jubeku commented Jun 10, 2025 •

edited

Loading

FussyDuck commented Jun 10, 2025 •

edited

Loading