Skip to content

Replacing SubsetRandomSampler by RandomSampler in BATCH_SAMPLER #3261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

NohTow
Copy link
Contributor

@NohTow NohTow commented Mar 7, 2025

Summary

This PR replaces the SubsetRandomSampler by a RandomSampler in the BACH_SAMPLER implementation.
It allows to drop memory usage by a large amount without losing performance or changing the current behavior.

Motivations

  • SubsetRandomSampler creates an explicit list of indices that is stored in memory. This can become very large for large datasets, and even more so when using multiple datasets.
  • RandomSampler actually uses lower level PyTorch tensors which are less costly and sample on the fly instead of keeping it in memory.
  • When used to sample from the whole dataset, RandomSampler (without replacement, the default behavior) is equivalent to SubsetRandomSampler.

Results & tests

  • I ran a PyLate training on the whole unsupervised Nomic dataset with 8 GPUs and 8 workers and the RAM usage dropped from 1T6GB to 620GB (this is fairly extreme, but it highlights how much we can gain). The results looked pretty sane
  • @tomaarsen ran a small bench and the RandomSampler is both faster and more memory efficient for sampling a 1M dataset (80.0MB, 1.8s-1.9s vs 136.0MB, 2.5s-2.7s)
  • We also discussed with Tom as well and I am pretty sure this is equivalent and just a free lunch as we are not sampling from a subset.

Changes

  • Simply replace SubsetRandomSampler with RandomSampler in the BATCH_SAMPLER implementation.

Related Issue

I did not actually open an issue about it, but I can if needed.

Shout-out to @arn4 for finding the reason why simply sampling ids was actually very slow and costly in a resource intensive setup!

Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed ran some tests with this, and it seems like the RandomSampler is indeed faster and less memory-intensive! Thanks a bunch for this.

@tomaarsen tomaarsen merged commit a3466a0 into UKPLab:master Mar 7, 2025
9 checks passed
arn4 added a commit to arn4/pytorch that referenced this pull request Mar 13, 2025
Digging further the issue at UKPLab/sentence-transformers#3261, the root of the problem is this iteration.
arn4 added a commit to arn4/pytorch that referenced this pull request Mar 13, 2025
… list

Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Mar 31, 2025
… list (#149126)

Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.

Pull Request resolved: #149126
Approved by: https://github.com/divyanshk, https://github.com/cyyever
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
… list (pytorch#149126)

Digging further the problem at UKPLab/sentence-transformers#3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.

Pull Request resolved: pytorch#149126
Approved by: https://github.com/divyanshk, https://github.com/cyyever
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants