You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Then after the first epoch, before the second epoch starts, I get this error:
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Is this a bug or something I should address on my end?
Dataset
Folder
Model
Reverse Distillation
Steps to reproduce the behavior
Just train a model with anomalib 2.0.0 and recent lightning 2.5.1
OS information
OS information:
OS: 22.04
Python version: 3.10
Anomalib version: 2.0.0
PyTorch version: 2.6.0
CUDA/cuDNN version: 12.6
GPU models and configuration: 4x NVIDIA RTX A6000
Any other relevant information: I'm using a custom dataset
Expected behavior
I would expect that anomalib would pass the correct batch size to lightning.
Screenshots
Pip/GitHub
pip
What version/branch did you use?
No response
Configuration YAML
none
Logs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:lightning_fabric.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params | Mode
--------------------------------------------------------------------
0 | pre_processor | PreProcessor | 0 | train
1 | post_processor | PostProcessor | 0 | train
2 | evaluator | Evaluator | 0 | train
3 | model | ReverseDistillationModel | 89.0 M | train
4 | loss | ReverseDistillationLoss | 0 | train
--------------------------------------------------------------------
89.0 M Trainable params
0 Non-trainable params
89.0 M Total params
356.009 Total estimated model params size (MB)
347 Modules in train mode
0 Modules ineval mode
Epoch 0: 0%|| 0/93 [00:00<?, ?it/s]/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/core/module.py:512: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [12:22<00:00, 0.13it/s, train_loss_step=0.147]
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 93/93 [12:22<00:00, 0.13it/s, train_loss_step=0.147, train_loss_epoch=0.474]INFO:lightning_fabric.utilities.rank_zero:Epoch 0, global step 93: 'train_loss' reached 0.47396 (best 0.47396), saving model to '/data/scratch/anomalib-2/results/ReverseDistillation/anomalib/latest/checkpoints/epoch=0-step=93.ckpt' as top 1
Epoch 1: 19%|███████████ | 18/93 [02:17<09:31, 0.13it/s, train_loss_step=0.131, train_loss_epoch=0.474]
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
haimat
changed the title
[Bug]: Trying to infer the batch_size from an ambiguous collection
[Bug]: Multiple warnings when training on multiple GPUs with 2.0.0
Mar 26, 2025
haimat
changed the title
[Bug]: Multiple warnings when training on multiple GPUs with 2.0.0
[Bug]: Warnings when training on multiple GPUs with 2.0.0
Mar 26, 2025
Describe the bug
When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:
Then after the first epoch, before the second epoch starts, I get this error:
Is this a bug or something I should address on my end?
Dataset
Folder
Model
Reverse Distillation
Steps to reproduce the behavior
Just train a model with anomalib 2.0.0 and recent lightning 2.5.1
OS information
OS information:
Expected behavior
I would expect that anomalib would pass the correct batch size to lightning.
Screenshots
Pip/GitHub
pip
What version/branch did you use?
No response
Configuration YAML
none
Logs
Code of Conduct
The text was updated successfully, but these errors were encountered: