Skip to content

[Bug]: Warnings when training on multiple GPUs with 2.0.0 #2631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
haimat opened this issue Mar 26, 2025 · 0 comments
Open
1 task done

[Bug]: Warnings when training on multiple GPUs with 2.0.0 #2631

haimat opened this issue Mar 26, 2025 · 0 comments

Comments

@haimat
Copy link

haimat commented Mar 26, 2025

Describe the bug

When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Then after the first epoch, before the second epoch starts, I get this error:

/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Is this a bug or something I should address on my end?

Dataset

Folder

Model

Reverse Distillation

Steps to reproduce the behavior

Just train a model with anomalib 2.0.0 and recent lightning 2.5.1

OS information

OS information:

  • OS: 22.04
  • Python version: 3.10
  • Anomalib version: 2.0.0
  • PyTorch version: 2.6.0
  • CUDA/cuDNN version: 12.6
  • GPU models and configuration: 4x NVIDIA RTX A6000
  • Any other relevant information: I'm using a custom dataset

Expected behavior

I would expect that anomalib would pass the correct batch size to lightning.

Screenshots

Image

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

none

Logs

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:lightning_fabric.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name           | Type                     | Params | Mode
--------------------------------------------------------------------
0 | pre_processor  | PreProcessor             | 0      | train
1 | post_processor | PostProcessor            | 0      | train
2 | evaluator      | Evaluator                | 0      | train
3 | model          | ReverseDistillationModel | 89.0 M | train
4 | loss           | ReverseDistillationLoss  | 0      | train
--------------------------------------------------------------------
89.0 M    Trainable params
0         Non-trainable params
89.0 M    Total params
356.009   Total estimated model params size (MB)
347       Modules in train mode
0         Modules in eval mode
Epoch 0:   0%|                                                                                                                 | 0/93 [00:00<?, ?it/s]/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/core/module.py:512: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147]
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 93/93 [12:22<00:00,  0.13it/s, train_loss_step=0.147, train_loss_epoch=0.474]INFO:lightning_fabric.utilities.rank_zero:Epoch 0, global step 93: 'train_loss' reached 0.47396 (best 0.47396), saving model to '/data/scratch/anomalib-2/results/ReverseDistillation/anomalib/latest/checkpoints/epoch=0-step=93.ckpt' as top 1
Epoch 1:  19%|███████████                                              | 18/93 [02:17<09:31,  0.13it/s, train_loss_step=0.131, train_loss_epoch=0.474]

Code of Conduct

  • I agree to follow this project's Code of Conduct
@haimat haimat changed the title [Bug]: Trying to infer the batch_size from an ambiguous collection [Bug]: Multiple warnings when training on multiple GPUs with 2.0.0 Mar 26, 2025
@haimat haimat changed the title [Bug]: Multiple warnings when training on multiple GPUs with 2.0.0 [Bug]: Warnings when training on multiple GPUs with 2.0.0 Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant