persistent workers=True and Pinned memory=True is unstable #324

sef43 · 2024-05-16T08:24:01Z

When running with default settings I get a warning:

/scratch/users/sfarr/miniconda3/envs/tmd/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:448: The combination of `DataLoader(`pin_memory=True`, `persistent_workers=True`) and `Trainer(reload_dataloaders_every_n_epochs > 0)` can lead to instability due to limitations in PyTorch (https://github.com/pytorch/pytorch/issues/91252). We recommend setting `pin_memory=False` in this case.

It will run the training for a long time but eventually crash with something like (using 4 GPUs):

File "/home/steve/miniconda3/envs/tmd/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[rank: 1] Child process with PID 224779 terminated with code 1. Forcefully terminating all other processes to avoid zombies :zombie:

I can make it stable by setting number workers to zero and #322
But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.

These options should be able to be set in the config yaml. I do not know the performance effects of either.

The text was updated successfully, but these errors were encountered:

sef43 · 2024-05-16T08:25:27Z

Note that I am using the ACE dataset type, might be specific to this, I am not sure

RaulPPelaez · 2024-05-16T08:44:10Z

pin memory is a little borked in lightning, it will crash if you have "enough" workers. We should probably just turn it off

RaulPPelaez mentioned this issue May 16, 2024

Disable persistent_workers with num_workers=0 and disable pin_memory in DataModule #322

Merged

RaulPPelaez closed this as completed in #322 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent workers=True and Pinned memory=True is unstable #324

persistent workers=True and Pinned memory=True is unstable #324

sef43 commented May 16, 2024

sef43 commented May 16, 2024

RaulPPelaez commented May 16, 2024

persistent workers=True and Pinned memory=True is unstable #324

persistent workers=True and Pinned memory=True is unstable #324

Comments

sef43 commented May 16, 2024

sef43 commented May 16, 2024

RaulPPelaez commented May 16, 2024