Lower memory requirements on single GPU #321

a-r-r-o-w · 2025-03-13T01:14:38Z

Testing only with CogView4. It should have a similar effect on all other models due to same code paths.

On multi-GPU, there should be no change based on the changes made.

On single-GPU using no memory optimization flags (except the must-use defaults like --gradient_checkpointing:

Phase	Value	Before	This PR
Memory before training start	memory_allocated	11.973	11.973
	memory_reserved	11.98	11.98
	max_memory_allocated	11.973	11.973
	max_memory_reserved	11.98	11.98
Memory before validation start	memory_allocated	12.209	12.209
	memory_reserved	13.922	12.934
	max_memory_allocated	28.403	16.429
	max_memory_reserved	28.842	16.863
Memory after validation end	memory_allocated	29.359	29.358
	memory_reserved	29.383	35.117
	max_memory_allocated	32.706	32.705
	max_memory_reserved	35.111	35.117
Memory before validation start	memory_allocated	29.361	12.211
	memory_reserved	31.373	14.229
	max_memory_allocated	30.934	13.784
	max_memory_reserved	31.373	14.229

The peak memory usage is not reduced in this case. It makes sense because during validation, all components will be loaded onto the GPU. If offloading, such as enable_model_cpu_offload is enabled, we can reduce the peak!

On single-GPU, using FP8 layerwise casting + model cpu offloading:

  --layerwise_upcasting_modules transformer
  --layerwise_upcasting_storage_dtype float8_e4m3fn
  --enable_model_cpu_offload

Fixes made:

For single GPUs, move transformer to CPU when performing precomputation
Deallocate conditioning and latent modules after validation
If using enable_model_cpu_offload, don't move everything to GPU before the pipeline begins validation

Phase	Value	Before	This PR
Memory before training start	memory_allocated	6.719	6.719
	memory_reserved	6.732	6.732
	max_memory_allocated	6.719	6.719
	max_memory_reserved	6.732	6.732
Memory before validation start	memory_allocated	6.956	6.956
	memory_reserved	9.719	9.562
	max_memory_allocated	23.149	16.429
	max_memory_reserved	23.576	16.863
Memory after validation end	memory_allocated	0.236	0.238
	memory_reserved	0.273	0.379
	max_memory_allocated	23.149	16.646
	max_memory_reserved	23.576	16.863
Memory before validation start	memory_allocated	24.105	6.956
	memory_reserved	26.006	9.65
	max_memory_allocated	25.701	9.078
	max_memory_reserved	26.006	9.65

cc @neph1 Please give it a spin when you get time!

a-r-r-o-w · 2025-03-16T04:15:22Z

@neph1 LMK if you had a chance to test this. Can proceed to merge if looks good

neph1 · 2025-03-16T16:54:25Z

Sorry for the delay. I seem to have messed up my branch when making the json dataset branch. I don't have the optimized ptd atm it seems.
In any case, the comments I made in the v0.1.0 PR was after a run on that branch, so it's functional.
During training I didn't notice any difference in VRAM requirements, but I guess that was expected since I didn't run with validation. I, so far, haven't set it up properly.
I'll do that for my next dataset and will report back, but it might take some days.
I say, merge this and keep working on the other things for the next stable release.

a-r-r-o-w added 5 commits March 13, 2025 01:01

update

02cb864

update

17036a2

update

a0b1401

update

d267ed9

update

7bff20f

a-r-r-o-w changed the title ~~Lower memory requirements~~ Lower memory requirements on single GPU Mar 13, 2025

a-r-r-o-w mentioned this pull request Mar 13, 2025

Prepare for v0.1.0 release #322

Merged

Merge branch 'main' into feature/lower-memory-requirements

ea5e26d

a-r-r-o-w mentioned this pull request Mar 14, 2025

Memory using Validation #295

Open

Merge branch 'main' into feature/lower-memory-requirements

021122d

a-r-r-o-w merged commit 9a479b1 into main Mar 16, 2025
1 check passed

a-r-r-o-w deleted the feature/lower-memory-requirements branch March 16, 2025 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lower memory requirements on single GPU #321

Lower memory requirements on single GPU #321

Uh oh!

a-r-r-o-w commented Mar 13, 2025 •

edited

Loading

Uh oh!

a-r-r-o-w commented Mar 16, 2025

Uh oh!

neph1 commented Mar 16, 2025

Uh oh!

Uh oh!

Uh oh!

Lower memory requirements on single GPU #321

Lower memory requirements on single GPU #321

Uh oh!

Conversation

a-r-r-o-w commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w commented Mar 16, 2025

Uh oh!

neph1 commented Mar 16, 2025

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w commented Mar 13, 2025 •

edited

Loading