You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now we have decided to not modify this and rather adapt our code to make it work. See 'def run_id' and 'def update_paths' in train.py. This fact translates into the following behaviour depending on the 'type of run' we submit.
This could lead to rank0 tasks having a diverging behaviour compared with the rest of the tasks0 and could be prone to bugs difficult to debug. Could be worth reviewing this design and thinking if there is a way to distribute/broadcast the run_id to all other tasks
The text was updated successfully, but these errors were encountered:
Potential solution/idea - Use the strategy/torch distributed to broadcast the run id from rank 0 to all other ranks, and then update the paths. This solution works in multinode and multi-gpu cases, as well as if we run an srun interactive session and then execute 'aifs-train' or other commands (tested using srun -c 64 --mem=64G --partition=gpu --gpus-per-node=1^C-ntasks-per-node=1 -t 02:00:00 --pty bash) or simply if we use 1 gpu.
Screenshot 2024-07-23 at 11 30 35
Using the launcher to do the broadcasting and setting up the strategy environment ensures that the code works both in sbatch and interactive sessions, and also follow the way it's is done in pytorch lightning since to submit the training/validation and testing runs it uses:
call._call_and_handle_interrupt(
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
)
Is your feature request related to a problem? Please describe.
Current mlflow logger implementation means that 'rank0' tasks will have access to run_id. This is 'imposed/given' by PTL MlflowLogger implementation (see https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger):
For now we have decided to not modify this and rather adapt our code to make it work. See 'def run_id' and 'def update_paths' in train.py. This fact translates into the following behaviour depending on the 'type of run' we submit.

This could lead to rank0 tasks having a diverging behaviour compared with the rest of the tasks0 and could be prone to bugs difficult to debug. Could be worth reviewing this design and thinking if there is a way to distribute/broadcast the run_id to all other tasks
The text was updated successfully, but these errors were encountered: