Skip to content

Lower memory requirements on single GPU #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 16, 2025

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Mar 13, 2025

Fixes #315 (comment)

Testing only with CogView4. It should have a similar effect on all other models due to same code paths.


On multi-GPU, there should be no change based on the changes made.


On single-GPU using no memory optimization flags (except the must-use defaults like --gradient_checkpointing:

Phase Value Before This PR
Memory before training start memory_allocated 11.973 11.973
memory_reserved 11.98 11.98
max_memory_allocated 11.973 11.973
max_memory_reserved 11.98 11.98
Memory before validation start memory_allocated 12.209 12.209
memory_reserved 13.922 12.934
max_memory_allocated 28.403 16.429
max_memory_reserved 28.842 16.863
Memory after validation end memory_allocated 29.359 29.358
memory_reserved 29.383 35.117
max_memory_allocated 32.706 32.705
max_memory_reserved 35.111 35.117
Memory before validation start memory_allocated 29.361 12.211
memory_reserved 31.373 14.229
max_memory_allocated 30.934 13.784
max_memory_reserved 31.373 14.229

The peak memory usage is not reduced in this case. It makes sense because during validation, all components will be loaded onto the GPU. If offloading, such as enable_model_cpu_offload is enabled, we can reduce the peak!


On single-GPU, using FP8 layerwise casting + model cpu offloading:

  --layerwise_upcasting_modules transformer
  --layerwise_upcasting_storage_dtype float8_e4m3fn
  --enable_model_cpu_offload

Fixes made:

  • For single GPUs, move transformer to CPU when performing precomputation
  • Deallocate conditioning and latent modules after validation
  • If using enable_model_cpu_offload, don't move everything to GPU before the pipeline begins validation
Phase Value Before This PR
Memory before training start memory_allocated 6.719 6.719
memory_reserved 6.732 6.732
max_memory_allocated 6.719 6.719
max_memory_reserved 6.732 6.732
Memory before validation start memory_allocated 6.956 6.956
memory_reserved 9.719 9.562
max_memory_allocated 23.149 16.429
max_memory_reserved 23.576 16.863
Memory after validation end memory_allocated 0.236 0.238
memory_reserved 0.273 0.379
max_memory_allocated 23.149 16.646
max_memory_reserved 23.576 16.863
Memory before validation start memory_allocated 24.105 6.956
memory_reserved 26.006 9.65
max_memory_allocated 25.701 9.078
max_memory_reserved 26.006 9.65

cc @neph1 Please give it a spin when you get time!

@a-r-r-o-w a-r-r-o-w changed the title Lower memory requirements Lower memory requirements on single GPU Mar 13, 2025
@a-r-r-o-w
Copy link
Member Author

@neph1 LMK if you had a chance to test this. Can proceed to merge if looks good

@neph1
Copy link
Contributor

neph1 commented Mar 16, 2025

Sorry for the delay. I seem to have messed up my branch when making the json dataset branch. I don't have the optimized ptd atm it seems.
In any case, the comments I made in the v0.1.0 PR was after a run on that branch, so it's functional.
During training I didn't notice any difference in VRAM requirements, but I guess that was expected since I didn't run with validation. I, so far, haven't set it up properly.
I'll do that for my next dataset and will report back, but it might take some days.
I say, merge this and keep working on the other things for the next stable release.

@a-r-r-o-w a-r-r-o-w merged commit 9a479b1 into main Mar 16, 2025
1 check passed
@a-r-r-o-w a-r-r-o-w deleted the feature/lower-memory-requirements branch March 16, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't launch with accelerate on single gpu
2 participants