[Improvement] Mlflow - run_id just known by rank0 #240

anaprietonem · 2025-04-09T06:45:26Z

Is your feature request related to a problem? Please describe.
Current mlflow logger implementation means that 'rank0' tasks will have access to run_id. This is 'imposed/given' by PTL MlflowLogger implementation (see https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger):

    @property
    @rank_zero_experiment
    def experiment(self)

For now we have decided to not modify this and rather adapt our code to make it work. See 'def run_id' and 'def update_paths' in train.py. This fact translates into the following behaviour depending on the 'type of run' we submit.

This could lead to rank0 tasks having a diverging behaviour compared with the rest of the tasks0 and could be prone to bugs difficult to debug. Could be worth reviewing this design and thinking if there is a way to distribute/broadcast the run_id to all other tasks

The text was updated successfully, but these errors were encountered:

anaprietonem · 2025-04-09T06:46:02Z

Potential solution/idea - Use the strategy/torch distributed to broadcast the run id from rank 0 to all other ranks, and then update the paths. This solution works in multinode and multi-gpu cases, as well as if we run an srun interactive session and then execute 'aifs-train' or other commands (tested using srun -c 64 --mem=64G --partition=gpu --gpus-per-node=1^C-ntasks-per-node=1 -t 02:00:00 --pty bash) or simply if we use 1 gpu.

Screenshot 2024-07-23 at 11 30 35
Using the launcher to do the broadcasting and setting up the strategy environment ensures that the code works both in sbatch and interactive sessions, and also follow the way it's is done in pytorch lightning since to submit the training/validation and testing runs it uses:
call._call_and_handle_interrupt(
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
)

See logs example with 1 full node - https://mlflow.copernicus-climate.eu/#/experiments/35/runs/48aeab1d25554842b4157c790510c8b2/artifacts
And logs from resuming that run using 2 nodes/2gpus - https://mlflow.copernicus-climate.eu/#/experiments/35/runs/de2953a5977e4c6b8968b7a339bb11db/artifacts

As mentioned in the comment of the 'update_paths' function the function would need to be called after the trainer object is defined. Since we would need access to the launcher and cluster environment that get's instantiated in the accelerator connector of the ptl.Trainer.
https://lightning.ai/docs/pytorch/stable/extensions/strategy.html
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/trainer/connectors/accelerator_connector.py

trainer=pl.Trainer(....)
self.update_paths()
trainer.fit(..)

github-project-automation bot added this to Anemoi-dev Apr 9, 2025

jjlk mentioned this issue Apr 9, 2025

fix: Checkpoint path check for multiple tasks/GPUs training #242

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Mlflow - run_id just known by rank0 #240

[Improvement] Mlflow - run_id just known by rank0 #240

anaprietonem commented Apr 9, 2025

anaprietonem commented Apr 9, 2025

[Improvement] Mlflow - run_id just known by rank0 #240

[Improvement] Mlflow - run_id just known by rank0 #240

Comments

anaprietonem commented Apr 9, 2025

anaprietonem commented Apr 9, 2025