[DRAFT] Fix issue #220 #480
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Draft implementation for the issue #220 (Initialization of parallelization on LSF Scheduler)
The main idea is to have the Python code read a set of possible environment variables (e.g., from a JSON configuration) that define the rank and size of each process. Since each HPC job scheduler exposes different environment variables (SLURM exposes SLURM_PROCID/SLURM_NTASKS, but LSF that use mpirun exposes PMI_RANK and PMI_SIZE), this approach avoids hardcoding scheduler-specific names. Instead, the model will iterate through a list of known variable names.
Since the issue #437 (DeepSpeed Support) has been opened, its resolution will overlap with the current issue. . In particular, init_torch() and init_ddp() are not needed, as they are managed under the hood by DeepSpeed.
Type of Change
Issue Number
Issue #220
Code Compatibility
Code Performance and Testing
uv run train
and (if necessary)uv run evaluate
on a least one GPU node and it works$WEATHER_GENERATOR_PRIVATE
directoryDependencies
Documentation
Additional Notes