-
-
Notifications
You must be signed in to change notification settings - Fork 488
Offload
Offload is a method of moving model or parts of the model between the GPU memory (VRAM) and system memory (RAM) in order to reduce the memory footprint of the model and allow it to run on GPUs with lower VRAM.
Tip
Offload mode is set by the Settings -> Models & Loading -> Model offload mode
Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.
- Recommended for compatible high VRAM GPUs
- Faster but requires compatible platform and sufficient VRAM
- Balanced offload moves parts of the model depending on the user-specified threshold allowing to control how much VRAM is to be used
- High threshold will set the maximum memory usage allowed for the model weights of a single model component
- Low threshold will decide when to offload unused models back to RAM. If the VRAM usage is higher than the low threshold, it will offload, otherwise it will do nothing
- Default high memory threshold is 70% of the available GPU memory
- Default low memory threshold is 20% of the available GPU memory for GPUs with more than 12GB of VRAM, otherwise the default is 0%
- Configure threshold in Settings -> Models & Loading -> Balanced offload GPU high / low watermark
Warning
Not compatible with Optimum.Quanto qint
quantization
Works on layer-by-layer basis of each model component that is marked as offload-compatible
- Recommended for low VRAM GPUs
- Much slower but allows to run large models such as FLUX even on GPUs with 2-4GB VRAM
Warning
Not compatible with Quanto qint
or BitsAndBytes nf4
quantization
Note
Use of --lowvram
automatically triggers use of sequenential offload
Works on model component level by offloading components that are marked as offload-compatible
For example, VAE, text-encoder, etc.
- Recommended for medium when balanced offload is not compatible
- Higher compatibility than either balanced and sequential, but lesser savings
Limitations: N/A
- Tested using SDXL with 2 large LoRA models
- Sequential offload is default for GPUs with 4GB or less
- Balanced offload is default for GPUs with more than 4GB
Balanced offload is slower than no offload, but allows using large models such as SD35 and FLUX.1 out-of-the-box - Balanced offload set to default values
- LoRA overhead is measured in sec for first and subsequent iterations
- LoRA mode=backup can use up to 2x system memory
Using backup can be prohibitive on large models such as SD35 or FLUX.1
Offload mode | LoRA type | LoRA mode | LoRA overhead | End-to-end it/s | Note |
---|---|---|---|---|---|
none | none | N/A | N/A | 6.7 | fastest inference |
balanced | none | N/A | N/A | 4.5 | default without LoRA |
sequential | none | N/A | N/A | 0.6 | lowvram |
none | native | backup | 1.8 / 0.0 | 6.0 | |
balanced | native | backup | 1.3 / 0.0 | 2.8 | |
sequential | native | backup | 5.8 / 0.0 | 0.5 | |
none | native | fuse | 1.3 / 1.3 | 4.8 | |
balanced | native | fuse | 2.8 / 2.5 | 3.1 | default with LoRA |
sequential | native | fuse | 8.8 / 7.7 | 0.4 | |
none | diffusers | default | 2.9 / 2.9 | 3.8 | |
balanced | diffusers | default | 2.2 / 2.2 | 2.1 | |
sequential | diffusers | default | 4.6 / 4.6 | 0.3 | |
none | diffusers | fuse | 5.7 / 5.7 | 2.0 | |
balanced | diffusers | fuse | N/A | did not complete | |
sequential | diffusers | fuse | N/A | did not complete |