Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

NohTow · 2025-03-07T10:15:54Z

Summary

This PR replaces the SubsetRandomSampler by a RandomSampler in the BACH_SAMPLER implementation.
It allows to drop memory usage by a large amount without losing performance or changing the current behavior.

Motivations

SubsetRandomSampler creates an explicit list of indices that is stored in memory. This can become very large for large datasets, and even more so when using multiple datasets.
RandomSampler actually uses lower level PyTorch tensors which are less costly and sample on the fly instead of keeping it in memory.
When used to sample from the whole dataset, RandomSampler (without replacement, the default behavior) is equivalent to SubsetRandomSampler.

Results & tests

I ran a PyLate training on the whole unsupervised Nomic dataset with 8 GPUs and 8 workers and the RAM usage dropped from 1T6GB to 620GB (this is fairly extreme, but it highlights how much we can gain). The results looked pretty sane
@tomaarsen ran a small bench and the RandomSampler is both faster and more memory efficient for sampling a 1M dataset (80.0MB, 1.8s-1.9s vs 136.0MB, 2.5s-2.7s)
We also discussed with Tom as well and I am pretty sure this is equivalent and just a free lunch as we are not sampling from a subset.

Changes

Simply replace SubsetRandomSampler with RandomSampler in the BATCH_SAMPLER implementation.

Related Issue

I did not actually open an issue about it, but I can if needed.

Shout-out to @arn4 for finding the reason why simply sampling ids was actually very slow and costly in a resource intensive setup!

…mplementation

tomaarsen

I indeed ran some tests with this, and it seems like the RandomSampler is indeed faster and less memory-intensive! Thanks a bunch for this.

Digging further the issue at UKPLab/sentence-transformers#3261, the root of the problem is this iteration.

… list Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.

… list (#149126) Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue. Pull Request resolved: #149126 Approved by: https://github.com/divyanshk, https://github.com/cyyever

… list (pytorch#149126) Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue. Pull Request resolved: pytorch#149126 Approved by: https://github.com/divyanshk, https://github.com/cyyever

Replacing SubsetRandomSampler by RandomSampler in the BATCH_SAMPLER i…

735a49a

…mplementation

tomaarsen approved these changes Mar 7, 2025

View reviewed changes

tomaarsen merged commit a3466a0 into UKPLab:master Mar 7, 2025
9 checks passed

arn4 added a commit to arn4/pytorch that referenced this pull request Mar 13, 2025

SubsetRandomSampler - iteration over a list instead of torch.tensor

20f4351

Digging further the issue at UKPLab/sentence-transformers#3261, the root of the problem is this iteration.

arn4 mentioned this pull request Mar 13, 2025

SubsetRandomSampler - changed iteration over tensor to iteration over list pytorch/pytorch#149126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

Uh oh!

NohTow commented Mar 7, 2025

Uh oh!

tomaarsen left a comment

Uh oh!

Uh oh!

Uh oh!

Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

Uh oh!

Conversation

NohTow commented Mar 7, 2025

Summary

Motivations

Results & tests

Changes

Related Issue

Uh oh!

tomaarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!