Open
Description
Bug description
When trainnig a model with DDP strategy and using ModelCheckpoint, the trainer is not saving only one checkpoint.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
pl_model=...
ckpt_path=...
monitor = "val_loss"
ckpt_filename = 'epoch{epoch}-val_loss{val_loss:.3f}'
latest_ckpt_filename = 'latest-epoch{epoch}-val_loss{val_loss:.3f}'
save_on_train_epoch_end = False
checkpoint_callback = ModelCheckpoint(dirpath=ckpt_path,
monitor=monitor,
mode="min",
filename=ckpt_filename,
auto_insert_metric_name=False,
save_top_k = 5,
save_weights_only=False,
save_on_train_epoch_end=save_on_train_epoch_end
)
latest_checkpoint_callback = ModelCheckpoint(
monitor='epoch',
mode="max",
dirpath=ckpt_path,
filename=latest_ckpt_filename,
auto_insert_metric_name=False,
save_top_k = 1,
save_weights_only=False,
save_on_train_epoch_end=save_on_train_epoch_end,
every_n_epochs=1
)
callbacks=[checkpoint_callback,latest_checkpoint_callback]
trainer = pl.Trainer( callbacks = callbacks,
strategy = "auto",
)
trainer.fit(pl_model)
Error messages and logs
# Error messages and logs here please
There is no error message
Environment
Current environment
#- Lightning Component : Trainer and ModelCheckpoint
#- PyTorch Lightning Version : 2.0.9
#- PyTorch Version : 2.0.0
#- Python version : 3.10
#- CUDA/cuDNN version: NCCL version 2.16.2+cuda11.8
#- GPU models and configuration : AWS ml.p3.8xlarge : 4 NVIDIA Tesla V100 GPUs, NVLink
#- How you installed Lightning : pip
More info
No response