Open
Description
Describe the bug
When I train a RD model with anomalib 2.0.0 on a machine with mulitple GPUs, I get this warning message right before the first epoch starts:
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Then after the first epoch, before the second epoch starts, I get this error:
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Is this a bug or something I should address on my end?
Dataset
Folder
Model
Reverse Distillation
Steps to reproduce the behavior
Just train a model with anomalib 2.0.0 and recent lightning 2.5.1
OS information
OS information:
- OS: 22.04
- Python version: 3.10
- Anomalib version: 2.0.0
- PyTorch version: 2.6.0
- CUDA/cuDNN version: 12.6
- GPU models and configuration: 4x NVIDIA RTX A6000
- Any other relevant information: I'm using a custom dataset
Expected behavior
I would expect that anomalib would pass the correct batch size to lightning.
Screenshots
Pip/GitHub
pip
What version/branch did you use?
No response
Configuration YAML
none
Logs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:lightning_fabric.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
INFO:anomalib.data.datamodules.base.image:No normal test images found. Sampling from training set using ratio of 0.20
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
WARNING:anomalib.metrics.evaluator:Number of devices is greater than 1, setting compute_on_cpu to False.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params | Mode
--------------------------------------------------------------------
0 | pre_processor | PreProcessor | 0 | train
1 | post_processor | PostProcessor | 0 | train
2 | evaluator | Evaluator | 0 | train
3 | model | ReverseDistillationModel | 89.0 M | train
4 | loss | ReverseDistillationLoss | 0 | train
--------------------------------------------------------------------
89.0 M Trainable params
0 Non-trainable params
89.0 M Total params
356.009 Total estimated model params size (MB)
347 Modules in train mode
0 Modules in eval mode
Epoch 0: 0%| | 0/93 [00:00<?, ?it/s]/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/core/module.py:512: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [12:22<00:00, 0.13it/s, train_loss_step=0.147]
/data/scratch/anomalib-2/python/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:434: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 93/93 [12:22<00:00, 0.13it/s, train_loss_step=0.147, train_loss_epoch=0.474]INFO:lightning_fabric.utilities.rank_zero:Epoch 0, global step 93: 'train_loss' reached 0.47396 (best 0.47396), saving model to '/data/scratch/anomalib-2/results/ReverseDistillation/anomalib/latest/checkpoints/epoch=0-step=93.ckpt' as top 1
Epoch 1: 19%|███████████ | 18/93 [02:17<09:31, 0.13it/s, train_loss_step=0.131, train_loss_epoch=0.474]
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels