Skip to content

[DeepSpeed] GPU VRAM usage increases each validation step #3690

@RJExtrac

Description

@RJExtrac

When using accelerate with deepspeed zero1 sharding over time the gpu VRAM usage of 2 gpus continually increases in a single node distributed setting. This appears to occur at each validation step but only with 2 gpus. Whether I use 4 or 8 gpus it's always 2 of them it increases with eventually to the point where OOM error will occur if left alone.

I've tried setting empty_cache_steps the same as the eval_steps and that alleviates the problem but doesn't solve it, as shown in the screenshots I've attached.

I recognise this might be an issue with my setup and/or a bug with deepspeed rather than accelerate but I would still be interested to hear other's opinions on this.

MLFlow plot of gpu memory usage with 8 gpus and cache emptying

Image

Plot of memory usage with 4 gpus and no cache emptying

Image

*all other hyperparameters were kept the same for both runs shown above

System Info

python 3.10.16
pytorch 2.6.0
accelerate 1.7.0
cuda 12.6
Driver Version: 560.35.05

accelerate config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /opt/workspace/training_pkg/recipes/deepspeed_configs/zero1.json
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

deepspeed config

{
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 200000000.0,
    "overlap_comm": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "steps_per_print": 100,
  "wall_clock_breakdown": false,
  "tensor_parallel": {
    "autotp_size": 2
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions