Add `pytorch_cuda_alloc_conf` config to tune VRAM memory allocation #7673

RyanJDick · 2025-02-24T17:28:30Z

Summary

This PR adds a pytorch_cuda_alloc_conf config flag to control the torch memory allocator behavior.

pytorch_cuda_alloc_conf defaults to None, preserving the current behavior.
The configuration options are explained here: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf. Tuning this configuration can reduce peak reserved VRAM and improve performance.
Setting pytorch_cuda_alloc_conf: "backend:cudaMallocAsync" in invokeai.yaml is expected to work well on many systems. This is a good first step for those looking to tune this config. (We may make this the default in the future.)
The optimal configuration seems to be dependent on a number of factors such as device version, VRAM, CUDA kernel version, etc. For now, users will have to experiment with this config to see if it hurts or helps on their systems. In most cases, I expect it to help.

Memory Tests

VAE decode memory usage comparison:

- SDXL, fp16, 1024x1024:
  - `cudaMallocAsync`: allocated=2593 MB, reserved=3200 MB
  - `native`:          allocated=2595 MB, reserved=4418 MB

- SDXL, fp32, 1024x1024:
  - `cudaMallocAsync`: allocated=3982 MB, reserved=5536 MB
  - `native`:          allocated=3982 MB, reserved=7276 MB

- SDXL, fp32, 1536x1536:
  - `cudaMallocAsync`: allocated=8643 MB, reserved=12032 MB
  - `native`:          allocated=8643 MB, reserved=15900 MB

Related Issues / Discussions

N/A

QA Instructions

Performance tests with pytorch_cuda_alloc_conf unset.
Performance tests with pytorch_cuda_alloc_conf: "backend:cudaMallocAsync".

Merge Plan

Merge Tidy app entrypoint #7668 first and change target branch to main

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

invokeai/app/util/torch_cuda_allocator.py

hipsterusername · 2025-02-24T20:59:25Z

As confirmation, i presume this does not play nicely on AMD?

RyanJDick · 2025-02-24T22:04:35Z

As confirmation, i presume this does not play nicely on AMD?

I haven't tested on AMD, but I would not expect the recommended config of backend:cudaMallocAsync to work on AMD. That being said, the native allocator configs documented here might work with AMD (don't have a way to test and couldn't find it documented clearly anywhere). We'd need someone to test whether they do and experiment to find a good recommendation.

…ests.

… config field that allows full customization of the CUDA allocator.

…mported() to only run if CUDA is available.

github-actions bot added python PRs that change python files services PRs that change app services python-tests PRs that change python tests labels Feb 24, 2025

RyanJDick force-pushed the ryan/vae-decode-mem branch from e7ff9d7 to 76430cb Compare February 24, 2025 17:30

github-actions bot added the docs PRs that change docs label Feb 24, 2025

RyanJDick mentioned this pull request Feb 24, 2025

Increase VAE decode memory estimates #7674

Merged

6 tasks

psychedelicious reviewed Feb 24, 2025

View reviewed changes

invokeai/app/util/torch_cuda_allocator.py Show resolved Hide resolved

RyanJDick marked this pull request as ready for review February 24, 2025 20:57

RyanJDick requested review from blessedcoolant, brandonrising, hipsterusername and lstein as code owners February 24, 2025 20:57

RyanJDick added the DO NOT MERGE label Feb 24, 2025

RyanJDick force-pushed the ryan/tidy-entry branch from 9ba2713 to 1e2c7c5 Compare February 28, 2025 20:54

Base automatically changed from ryan/tidy-entry to main February 28, 2025 21:07

RyanJDick force-pushed the ryan/vae-decode-mem branch from 6469f42 to 61cce5a Compare February 28, 2025 21:19

hipsterusername approved these changes Feb 28, 2025

View reviewed changes

RyanJDick added 7 commits February 28, 2025 21:39

Add utils for configuring the torch CUDA allocator.

766b752

Add use_cuda_malloc config option.

5302d48

Simplify is_torch_cuda_malloc_enabled() implementation and add unit t…

b31c71f

…ests.

Switch from use_cuda_malloc flag to a general pytorch_cuda_alloc_conf…

a36a627

… config field that allows full customization of the CUDA allocator.

Update low-vram docs with info abhout .

3af7fc2

Mark test_configure_torch_cuda_allocator_raises_if_torch_is_already_i…

4919170

…mported() to only run if CUDA is available.

(minor) typo

0e632db

RyanJDick force-pushed the ryan/vae-decode-mem branch from 96430db to 0e632db Compare February 28, 2025 21:39

RyanJDick enabled auto-merge February 28, 2025 21:41

RyanJDick merged commit 7567ee2 into main Feb 28, 2025
15 checks passed

RyanJDick deleted the ryan/vae-decode-mem branch February 28, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pytorch_cuda_alloc_conf` config to tune VRAM memory allocation #7673

Add `pytorch_cuda_alloc_conf` config to tune VRAM memory allocation #7673

RyanJDick commented Feb 24, 2025 •

edited

Loading

hipsterusername commented Feb 24, 2025

RyanJDick commented Feb 24, 2025

Add pytorch_cuda_alloc_conf config to tune VRAM memory allocation #7673

Add pytorch_cuda_alloc_conf config to tune VRAM memory allocation #7673

Conversation

RyanJDick commented Feb 24, 2025 • edited Loading

Summary

Memory Tests

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

hipsterusername commented Feb 24, 2025

RyanJDick commented Feb 24, 2025

Add `pytorch_cuda_alloc_conf` config to tune VRAM memory allocation #7673

Add `pytorch_cuda_alloc_conf` config to tune VRAM memory allocation #7673

RyanJDick commented Feb 24, 2025 •

edited

Loading