Skip to content

[Bug]: Warnings when training on multiple GPUs with 2.0.0 #2631

Open
@haimat

Description

@haimat

Describe the bug

When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Then after the first epoch, before the second epoch starts, I get this error:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Is this a bug or something I should address on my end?

Dataset

Folder

Model

Reverse Distillation

Steps to reproduce the behavior

Just train a model with anomalib 2.0.0 and recent lightning 2.5.1

OS information

OS information:

  • OS: 22.04
  • Python version: 3.10
  • Anomalib version: 2.0.0
  • PyTorch version: 2.6.0
  • CUDA/cuDNN version: 12.6
  • GPU models and configuration: 4x NVIDIA RTX A6000
  • Any other relevant information: I'm using a custom dataset

Expected behavior

I would expect that anomalib would pass the correct batch size to lightning.

Screenshots

Image

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

none

Logs

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:lightning_fabric.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name           | Type                     | Params | Mode
--------------------------------------------------------------------
0 | pre_processor  | PreProcessor             | 0      | train
1 | post_processor | PostProcessor            | 0      | train
2 | evaluator      | Evaluator                | 0      | train
3 | model          | ReverseDistillationModel | 89.0 M | train
4 | loss           | ReverseDistillationLoss  | 0      | train
--------------------------------------------------------------------
89.0 M    Trainable params
0         Non-trainable params
89.0 M    Total params
356.009   Total estimated model params size (MB)
347       Modules in train mode
0         Modules in eval mode
Epoch 0:   0%|                                                                                                                 | 0/93 [00:00<?, ?it/s]/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/core/module.py:512: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147]
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147, train_loss_epoch=0.474]INFO:lightning_fabric.utilities.rank_zero:Epoch 0, global step 93: 'train_loss' reached 0.47396 (best 0.47396), saving model to '/data/scratch/anomalib-2/results/ReverseDistillation/anomalib/latest/checkpoints/epoch=0-step=93.ckpt' as top 1
Epoch 1:  19%|███████████                                              | 18/93 [02:17<09:31,  0.13it/s, train_loss_step=0.131, train_loss_epoch=0.474]

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions