[DRAFT] Fix issue #220 #480

simone99n · 2025-07-08T14:55:11Z

Description

Draft implementation for the issue #220 (Initialization of parallelization on LSF Scheduler)
The main idea is to have the Python code read a set of possible environment variables (e.g., from a JSON configuration) that define the rank and size of each process. Since each HPC job scheduler exposes different environment variables (SLURM exposes SLURM_PROCID/SLURM_NTASKS, but LSF that use mpirun exposes PMI_RANK and PMI_SIZE), this approach avoids hardcoding scheduler-specific names. Instead, the model will iterate through a list of known variable names.

Since the issue #437 (DeepSpeed Support) has been opened, its resolution will overlap with the current issue. . In particular, init_torch() and init_ddp() are not needed, as they are managed under the hood by DeepSpeed.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Issue #220

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

Simone Norberti added 2 commits July 8, 2025 15:44

[DRAFT] fix issue ecmwf#220

a79bd7b

[DRAFT] fix issue ecmwf#220

e622b57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Fix issue #220 #480

[DRAFT] Fix issue #220 #480

Uh oh!

simone99n commented Jul 8, 2025

Uh oh!

Uh oh!

[DRAFT] Fix issue #220 #480

Are you sure you want to change the base?

[DRAFT] Fix issue #220 #480

Uh oh!

Conversation

simone99n commented Jul 8, 2025

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

Uh oh!