You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running with default settings I get a warning:
/scratch/users/sfarr/miniconda3/envs/tmd/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:448: The combination of `DataLoader(`pin_memory=True`, `persistent_workers=True`) and `Trainer(reload_dataloaders_every_n_epochs > 0)` can lead to instability due to limitations in PyTorch (https://github.com/pytorch/pytorch/issues/91252). We recommend setting `pin_memory=False` in this case.
It will run the training for a long time but eventually crash with something like (using 4 GPUs):
File "/home/steve/miniconda3/envs/tmd/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[rank: 1] Child process with PID 224779 terminated with code 1. Forcefully terminating all other processes to avoid zombies :zombie:
I can make it stable by setting number workers to zero and #322
But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.
These options should be able to be set in the config yaml. I do not know the performance effects of either.
The text was updated successfully, but these errors were encountered:
When running with default settings I get a warning:
It will run the training for a long time but eventually crash with something like (using 4 GPUs):
I can make it stable by setting number workers to zero and #322
But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.
These options should be able to be set in the config yaml. I do not know the performance effects of either.
The text was updated successfully, but these errors were encountered: