[DeepSpeed] GPU VRAM usage increases each validation step

When using accelerate with deepspeed zero1 sharding over time the gpu VRAM usage of 2 gpus continually increases in a single node distributed setting. This appears to occur at each validation step but only with 2 gpus. Whether I use 4 or 8 gpus it's always 2 of them it increases with eventually to the point where OOM error will occur if left alone.

I've tried setting empty_cache_steps the same as the eval_steps and that alleviates the problem but doesn't solve it, as shown in the screenshots I've attached.

I recognise this might be an issue with my setup and/or a bug with deepspeed rather than accelerate but I would still be interested to hear other's opinions on this.

### MLFlow plot of gpu memory usage with 8 gpus and cache emptying
<img width="1972" height="450" alt="Image" src="https://github.com/user-attachments/assets/614184ad-c566-4f43-8354-1151f4384472" />

### Plot of memory usage with 4 gpus and no cache emptying
<img width="1972" height="450" alt="Image" src="https://github.com/user-attachments/assets/def5a6cb-5e98-4f63-accd-599f2315032c" />

*all other hyperparameters were kept the same for both runs shown above

System Info

```
python 3.10.16
pytorch 2.6.0
accelerate 1.7.0
cuda 12.6
Driver Version: 560.35.05
```

accelerate config
```
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /opt/workspace/training_pkg/recipes/deepspeed_configs/zero1.json
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

deepspeed config
```
{
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 200000000.0,
    "overlap_comm": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "steps_per_print": 100,
  "wall_clock_breakdown": false,
  "tensor_parallel": {
    "autotp_size": 2
  }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeepSpeed] GPU VRAM usage increases each validation step #3690

MLFlow plot of gpu memory usage with 8 gpus and cache emptying

Plot of memory usage with 4 gpus and no cache emptying

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DeepSpeed] GPU VRAM usage increases each validation step #3690

Description

MLFlow plot of gpu memory usage with 8 gpus and cache emptying

Plot of memory usage with 4 gpus and no cache emptying

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions