Skip to content

[Improvement] Mlflow - run_id just known by rank0 #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
anaprietonem opened this issue Apr 9, 2025 · 1 comment
Open

[Improvement] Mlflow - run_id just known by rank0 #240

anaprietonem opened this issue Apr 9, 2025 · 1 comment

Comments

@anaprietonem
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Current mlflow logger implementation means that 'rank0' tasks will have access to run_id. This is 'imposed/given' by PTL MlflowLogger implementation (see https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger):

    @property
    @rank_zero_experiment
    def experiment(self)

For now we have decided to not modify this and rather adapt our code to make it work. See 'def run_id' and 'def update_paths' in train.py. This fact translates into the following behaviour depending on the 'type of run' we submit.
Screenshot 2024-04-04 at 09 09 24

This could lead to rank0 tasks having a diverging behaviour compared with the rest of the tasks0 and could be prone to bugs difficult to debug. Could be worth reviewing this design and thinking if there is a way to distribute/broadcast the run_id to all other tasks

@anaprietonem
Copy link
Collaborator Author

Potential solution/idea - Use the strategy/torch distributed to broadcast the run id from rank 0 to all other ranks, and then update the paths. This solution works in multinode and multi-gpu cases, as well as if we run an srun interactive session and then execute 'aifs-train' or other commands (tested using srun -c 64 --mem=64G --partition=gpu --gpus-per-node=1^C-ntasks-per-node=1 -t 02:00:00 --pty bash) or simply if we use 1 gpu.

Screenshot 2024-07-23 at 11 30 35
Using the launcher to do the broadcasting and setting up the strategy environment ensures that the code works both in sbatch and interactive sessions, and also follow the way it's is done in pytorch lightning since to submit the training/validation and testing runs it uses:
call._call_and_handle_interrupt(
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
)

Screenshot 2024-07-23 at 11 30 35

See logs example with 1 full node - https://mlflow.copernicus-climate.eu/#/experiments/35/runs/48aeab1d25554842b4157c790510c8b2/artifacts
And logs from resuming that run using 2 nodes/2gpus - https://mlflow.copernicus-climate.eu/#/experiments/35/runs/de2953a5977e4c6b8968b7a339bb11db/artifacts

As mentioned in the comment of the 'update_paths' function the function would need to be called after the trainer object is defined. Since we would need access to the launcher and cluster environment that get's instantiated in the accelerator connector of the ptl.Trainer.
https://lightning.ai/docs/pytorch/stable/extensions/strategy.html
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/trainer/connectors/accelerator_connector.py

trainer=pl.Trainer(....)
self.update_paths()
trainer.fit(..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant