-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
When using accelerate with deepspeed zero1 sharding over time the gpu VRAM usage of 2 gpus continually increases in a single node distributed setting. This appears to occur at each validation step but only with 2 gpus. Whether I use 4 or 8 gpus it's always 2 of them it increases with eventually to the point where OOM error will occur if left alone.
I've tried setting empty_cache_steps the same as the eval_steps and that alleviates the problem but doesn't solve it, as shown in the screenshots I've attached.
I recognise this might be an issue with my setup and/or a bug with deepspeed rather than accelerate but I would still be interested to hear other's opinions on this.
MLFlow plot of gpu memory usage with 8 gpus and cache emptying

Plot of memory usage with 4 gpus and no cache emptying

*all other hyperparameters were kept the same for both runs shown above
System Info
python 3.10.16
pytorch 2.6.0
accelerate 1.7.0
cuda 12.6
Driver Version: 560.35.05
accelerate config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: /opt/workspace/training_pkg/recipes/deepspeed_configs/zero1.json
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
deepspeed config
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"reduce_scatter": true,
"allgather_bucket_size": 200000000.0,
"overlap_comm": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"steps_per_print": 100,
"wall_clock_breakdown": false,
"tensor_parallel": {
"autotp_size": 2
}
}