Skip to content

ModelCheckpoint saves multiple checkpoints when trainer is using DDP #18590

Open
@sebquetin

Description

@sebquetin

Bug description

When trainnig a model with DDP strategy and using ModelCheckpoint, the trainer is not saving only one checkpoint.
image

What version are you seeing the problem on?

v2.0

How to reproduce the bug

pl_model=...
ckpt_path=...
monitor = "val_loss"
ckpt_filename = 'epoch{epoch}-val_loss{val_loss:.3f}'
latest_ckpt_filename = 'latest-epoch{epoch}-val_loss{val_loss:.3f}'
save_on_train_epoch_end = False

checkpoint_callback = ModelCheckpoint(dirpath=ckpt_path, 
                                          monitor=monitor,
                                          mode="min",
                                          filename=ckpt_filename, 
                                          auto_insert_metric_name=False,
                                          save_top_k = 5,
                                          save_weights_only=False,
                                          save_on_train_epoch_end=save_on_train_epoch_end
                                         )

latest_checkpoint_callback = ModelCheckpoint(
        monitor='epoch',
        mode="max",
        dirpath=ckpt_path,
        filename=latest_ckpt_filename, 
        auto_insert_metric_name=False,
        save_top_k = 1,
        save_weights_only=False,
        save_on_train_epoch_end=save_on_train_epoch_end,
        every_n_epochs=1
     )

callbacks=[checkpoint_callback,latest_checkpoint_callback]
trainer = pl.Trainer(   callbacks = callbacks,
                        strategy = "auto",
                     )
trainer.fit(pl_model)

Error messages and logs

# Error messages and logs here please

There is no error message

Environment

Current environment
#- Lightning Component : Trainer and ModelCheckpoint
#- PyTorch Lightning Version :  2.0.9 
#- PyTorch Version : 2.0.0
#- Python version : 3.10
#- CUDA/cuDNN version: NCCL version 2.16.2+cuda11.8
#- GPU models and configuration : AWS ml.p3.8xlarge : 4 NVIDIA Tesla V100 GPUs, NVLink
#- How you installed Lightning : pip

More info

No response

cc @carmocca @awaelchli @justusschock

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions