RTX 4090 performance #2449

Bachine · 2022-10-13T03:44:13Z

Is the 4090 fully supported in SD?

I am getting the same performance with the 4090 that my 3070 was getting.

Thomas-MMJ · 2022-10-13T05:25:51Z

what parameters are you using? You might want to do say a batch of 5+ images. Also is xformers installed and using --xformers? are you using --opt-channelslast ? half precision? etc.

Kosinkadink · 2022-10-13T10:55:19Z

I'm having the same issue, my 4090 is generating slower/around same speed as my 3090 used to on the same machine. I am using Windows 10, with --xformers as my only parameter. I will be setting up my 3090 in a different PC soon, so I will be able to provide some direct comparison it/s.

My "benchmark" is using just the prompt "chair" at 150 steps with all default settings (Euler a, 512x512, Scale 7, Clip Skip 1, ENSD 0).

Using --xformers doesn't seem to make a difference, either way I'm getting around 10.6 it/s.

C43H66N12O12S2 · 2022-10-13T12:37:42Z

Xformers currently lacks support for Lovelace (in fact, Pytorch also lacks it, I believe.)

Your quoted 3090 numbers are too low BTW. I get around 16it/s with your settings on a 3080 12GB.

I'll perform some further testing when my 4090 arrives. (and attempt to build xformers for it)

Kosinkadink · 2022-10-13T12:53:06Z

Gotcha, guess my 4090 performance will be meh until pytorch (and xformers) gets lovelace support.

Those numbers were for my 4090, my 3090 was not plugged in at the time, but it is now.

With those same settings, my 3090 gets around 15.7 it/s without --xformers. I don't have xformers set up yet on that machine (I'm running Ubuntu and will need to use workaround to get xformers installed properly).

So the 4090 currently is only 2/3rds the performance of a non-xformers 3090.

cmp-nct · 2022-10-14T04:02:05Z

I had hit 15-18 with my 3090 but now it's 13
No command line parameters, still same setup.
Strange

C43H66N12O12S2 · 2022-10-15T12:37:17Z

Preliminary testing:
4090 (or JIT PTX) really dislikes channels last. Halves performance
Without channels last, my 4090 is about 10 times slower than my 3080 with torch 1.12.1 + cu116

C43H66N12O12S2 · 2022-10-15T13:06:53Z

Updating cuDNN did the trick. Getting 15it/s without xformers.

To support Lovelace in xformers, we need a CUDA 11.8 build of PyTorch (I think.)

C43H66N12O12S2 · 2022-10-15T13:30:05Z

Nope, managed to build it. Getting a %43 speed-up compared to my 3080. 23it/s

With batch size 8, my 4090 is twice as fast compared to the 3080. ~40it/s

C43H66N12O12S2 · 2022-10-15T13:57:36Z

@ilcane87 @comp-nect @Kosinkadink Could you please test this wheel? This works on my 4090, but I need to make sure there isn't a regression (broken on Ampere or Pascal or whatever) with --force-enable-xformers
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

ilcane87 · 2022-10-15T14:18:20Z

@C43H66N12O12S2
Doesn't work for me on GeForce GTX 1060 6GB:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

C43H66N12O12S2 · 2022-10-15T14:47:58Z

@ilcane87 Please try this one. Much thanks for testing these for me :)
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

SSCBryce · 2022-10-15T14:55:26Z

@C43H66N12O12S2 I'd love to test this on my 4090 :D. Do I just activate venv and then pip install it while it sits in webui root dir?

C43H66N12O12S2 · 2022-10-15T14:56:23Z

Yep

SSCBryce · 2022-10-15T15:00:16Z

@C43H66N12O12S2 Oh, also says same version, should've seen that coming... assuming I also just --force-reinstall?

SSCBryce · 2022-10-15T15:11:53Z

Sorry if I'm not supposed to be pinging you, don't really ever develop software... seem to have borked it. It's the same issue I was having when I was trying to build this stuff myself. This is after it failed to load and I also added --skip-torch-cuda-test to make it load at least.

C43H66N12O12S2 · 2022-10-15T15:14:48Z

pip seems to have replaced your torch with a CPU only one. do pip uninstall torch and start repo again.

boyjunqiang · 2022-10-15T15:16:48Z

On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update

ilcane87 · 2022-10-15T15:23:44Z

@ilcane87 Please try this one. Much thanks for testing these for me :)
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

Still getting the same error on this one too.

SSCBryce · 2022-10-15T15:35:19Z

@C43H66N12O12S2 Is there any way to ensure it's using the new wheel? Getting about 11it/s on 512x512, 8 cfg, 40 steps, euler a on NAI (not sure all of which these affect performance so noting most settings).

C43H66N12O12S2 · 2022-10-15T15:37:18Z

Well, it would error with an older wheel. You need to add --force-enable-xformers to your COMMANDLINE_ARGS, btw. (for now)

SSCBryce · 2022-10-15T15:39:35Z

Oh, interesting. It is in there, though, along with no-half and precision full for NAI. Not sure if the model is making the difference.

C43H66N12O12S2 · 2022-10-15T15:41:31Z

Can you test your performance without xformers? I get 23it/s and --no-half should be about half that speed, so 11it/s sounds about right, actually.

SSCBryce · 2022-10-15T15:49:04Z

Weird. On first load without forced xformers, it never got past Commit hash:blabla. Closing the cmd and starting it again though produced a regular loadup time. Without xformers, same generation settings and cmdargs, getting about 9.6it/s. No batching.

C43H66N12O12S2 · 2022-10-15T15:51:47Z

Yeah, xformers is working for you. Use larger batches for bigger gains. Also consider removing --no-half and --precision full, IME FP16 and FP32 have only minute differences but FP16 is twice the speed.

C43H66N12O12S2 · 2022-10-15T15:56:18Z

@ilcane87 Please test this latest one:
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

SSCBryce · 2022-10-15T15:56:52Z

Was literally about to ask if I really need those, yeah, thanks for pointing it out! Wish more coders we're as clearly spoken and thoughtful as you are, man. Two random questions while I have your attention :D
Should this help training, too? That's what I'm currently into, and was seeing similar numbers. Should I batch that as well?
And then, is it better to increase batch count or size? Same either way? I always see a "decrease" in it/s when I batch but maybe that's just the UI giving weird numbers.

C43H66N12O12S2 · 2022-10-15T15:59:16Z

Yes.
Increasing size is equal to increasing count, but increasing size will also increase your speed with these large GPUs (practically anything faster than a 1070) where the VRAM bandwidth is a large bottleneck.

You're seeing a decrease because it's generating multiple images at once. To calculate what it'd correspond to for a single image, do iterations per second * batch size

sreinwald · 2023-03-28T04:46:31Z

If you are getting suboptimal performance on Windows even after upgrading pytorch and replacing cuDNN .dlls check your Windows settings and see if "Hardware-accelerated GPU scheduling" is turned on.
For me (Windows 11 Pro) it was enabled by default and disabling that setting boosted my it/s by roughly 30%.

On Windows 11 you can find that setting in System > Display > Graphics > Default graphics settings.

leemmcc · 2023-04-02T03:23:31Z

At this point, anyone just have a reproducable "git clone, do x, do y, get great performance" on 4090? So many conflicting instructions and partial steps in this thread by now.

BasedAnon · 2023-04-06T15:10:51Z

I'm getting <2 it/s with my 4090...
Did they incorporate lovelace?

SamueleLorefice · 2023-04-06T16:42:16Z

I'm getting <2 it/s with my 4090... Did they incorporate lovelace?

@BasedAnon less than 2 it/s? Using what settings, exactly?

At this point, anyone just have a reproducable "git clone, do x, do y, get great performance" on 4090? So many conflicting instructions and partial steps in this thread by now.

@leemmcc install linux, replace cuDNN libs, add --xformers --opt-sub-quad-attention --opt-channelslast and that's pretty much it. On windows, it's basically the same, but you also want to disable HAGS (in System > Display > Graphics > Default Graphics Settings), put your pc into high performance mode (process lasso can help with that and also with making sure it has the priority over everything) and that should be all. You will still have a bit of bottleneck anyway.

And that's as much as I know atm.

BasedAnon · 2023-04-07T07:30:46Z

@SamueleLorefice 512 x 904, Heun, Auto1111, everything else stock. (Anything V3 model with VAE)
I think it's the high resolution, I didn't realize how much slows it down.
I'm about to try ComfyUI instead.
Edit: I tried it again, at 512x512 and am now getting around 9it/s

BasedAnon · 2023-04-09T07:35:43Z

I updated to pytorch 2.0 and wow, I'm getting 16it/s+on 512x512.
The speeds seems to be about the same at larger resolutions though.

XeonG · 2023-04-09T14:01:44Z

can someone please do a guide for windows and updating pytorch 2.0 and which cuddn libs etc

VictorZakharov · 2023-04-09T14:31:23Z

@XeonG Would be better if someone introduced a config flag to A1111 to use torch 2.0.
BTW, I tried using 2.0, made no difference. Maybe I messed up somewhere.

morty6688 · 2023-04-09T14:34:05Z

#8696

From this post, we can use pip3 install clean-fid numba numpy torch==2.0.0+cu118 torchvision --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu118 to update. I'm using a 4080 laptop, before updating, I got 4-5 it/s. After updating, I get 12-13 it/s to generate a chair.

BTW, close HAGS seems no difference for me on win11 22621

VictorZakharov · 2023-04-09T20:20:25Z

@lijian12345 No effect in my case.
4070 Ti. 7 it/sec before and after.
512, 40 steps, fp16, no xformers.

rodsott · 2023-04-10T01:42:46Z

Lots of testing here today, to make my 4090 reaches 100% of GPU usage, like it works on System Info Benchmark, which there I have 40it/s marked.. But using the regular latest "git pull" update, with torch: 1.13.1+cu117 and xformers: 0.0.16rc425, I have a max speed of 19it/s at a capped 50% GPU usage. Tried both v11.8 and v12.1 of CUDA Runtime, and the respective cuDNN files at the .\venv\Lib\site-packages\torch\lib folder.

These tests were made at txt2img, 512x512, 20 steps, Batch size 10, with empty prompt, and checkpoint "v1-5-pruned-emaonly".

I tested with both NVIdia Studio and Game latest drivers, v.531.41.

I did a fresh install, and updated it to torch: 2.0.0+cu118, (which messed with torchvision, giving an error but still working), disabled xformers, and added the "--opt-sdp-attention" to the laucher. With that I've reached 25it/s with 60% GPU usage.. Better scenario so far, but wasn`t supposed to have the same behavior we have at the benchmark, with all the 1111 fuctionalities, like img2img, txt2img, Deforum, etc...?

Oh, I added to the .\modules\txt2img.py the following lines at the beggining, but nothing happened too:

import torch
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
torch.backends.cuda.matmul.allow_tf32 = True

Here is the results of the benchmark, using the Extra Steps and Extensive options, my 4090 reached 40it/s:

If anyone knows how to make auto1111 works at 100% CUDA usage, specially for the RTX 4090, please share a workaround here! Thanks in advance! =)

morty6688 · 2023-04-10T08:44:48Z

@lijian12345 No effect in my case. 4070 Ti. 7 it/sec before and after. 512, 40 steps, fp16, no xformers.

if no xformers, I got 8-9 it/s, maybe u can check if ur gpu driver or torch version is updated properly.
add --xformers --opt-sub-quad-attention --opt-channelslast can efficiently boost the iteration speed.

haldi4803 · 2023-04-10T10:03:30Z

No Xformers cdnn 8500.
Xformers cdnn 8500
Xformers cdnn 8800
Torch 2.0 with --opt-sub-quad-attention --opt-channelslast added

GuruVirus · 2023-04-19T19:30:44Z

For Win 11 I got it working and wrote up this guide: https://rentry.org/31337
For Linux instructions see: https://rentry.org/installing-automatic1111

VictorZakharov · 2023-04-19T21:22:20Z

@GuruVirus Followed the guide and got 20.69 it/sec in the 1200 test on my 4070 Ti.
Same numbers with batch size 1, so batch 8 test is not necessary.
Thanks!

haldi4803 · 2023-04-20T17:31:39Z

For Win 11 I got it working and wrote up this guide: https://rentry.org/31337 For Linux instructions see: https://rentry.org/installing-automatic1111

pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
(can skip if using --opt-sdp-no-mem-attention instead of --xformers)
If you get file path length error exit venv and run
git config --system core.longpaths true
If you get a generic compile failure, especially related to cmake, make sure you install C++ build tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/

Doesn't work. Any Special settings one need when installing C++ Build Tools?

https://pastebin.com/SGY7eQxU

VictorZakharov · 2023-04-20T18:44:42Z

@haldi4803 Use SDP instead. I had issues with xformers.

GuruVirus · 2023-04-21T22:43:29Z

I updated the guide to make it easier to skip xformers, which I recommend for simplicity.
If you want to continue with xformers, I noticed in your screenshot, in the right column, the Win11 SDK unchecked which is checked by default.

VictorZakharov · 2023-04-23T12:41:22Z

Yet another fix is to switch to using this repo:

https://github.com/vladmandic/automatic

aka Vlad1111. I had no issues getting 21 it/sec on 4070 Ti (512x512, 100 steps, euler a), without xformers (but with SDP). It is also faster to start rendering. A1111 takes a few seconds. Vlad1111 starts instantly. Which can be 50% improvement on low step 512x512 gens assuming high end video cards.

No additional configs, no tweaking, it just worked. Failed on the first run of webui.bat, but second time worked with no issues. Inpainting supported, upscale supported, pretty much all the standard features in A1111 that you might need additional setup for. Finally some competition.

Sirfrummel · 2023-04-23T13:51:54Z

Yeah, I just heard about this (from the same youtube video), I'm excited to try it out. One thing I didn't hear, was that are automatic1111 extensions supported, or do we have to wait for extension authors to update to be compatible? I'm thinking of things like dynamic prompts, or a specific depth map plugin I used to make 3D videos ?

keetruda69 · 2023-06-16T09:23:50Z

I tried https://rentry.org/installing-automatic1111 but i jumped from 4it/s to 20it/s. Wasn't able to move further. I tried to install (https://github.com/vladmandic/automatic) and it did the magic. I got right to 40it/s without any tweaks. Didn't have a time to check why, but it works like charm.

VictorZakharov · 2023-06-16T17:17:49Z

@keetruda69 This is what happens when devs ignore quality of life updates for too long, and only focus on features or low hanging fruit.

aifartist · 2023-06-16T18:20:45Z

I tried https://rentry.org/installing-automatic1111 but i jumped from 4it/s to 20it/s. Wasn't able to move further. I tried to install (https://github.com/vladmandic/automatic) and it did the magic. I got right to 40it/s without any tweaks. Didn't have a time to check why, but it works like charm.

I recall seeing that it was installing pytorch 2.0.1+cu117 instead of 2.0.1+cu118 where the fix is in.
Vlado's fork gets this right.

XeonG · 2023-06-18T21:44:52Z

vlado fork is looks awful (dunno why they don't stick to original theme).... but that's not the issue everytime I've tried reinstalling it from fresh, it seems to be really slow at loading, navigation barely working.. I dunno just seems very broken or I just seem to try it out at the worst commit times... slow js.. automatic seems to work but yeah performance with a 4090 is just plain garbage..the python stuff just seems like a mess.. why would anyone use such a broken janky language for this stuff

VictorZakharov · 2023-06-18T21:52:06Z

@XeonG Open a ticket, he's pretty quick to resolve all real issues.

XeonG · 2023-06-19T00:02:37Z

well I would.. but everytime I've tried it.. from fresh even clearing browser cache.. the same setup I have with models vae etc loaded as automatic.. on vlads one it just seems to have trouble loading, and generally sluggish ui.. the last time didn't even work at all with laoding model checkpoint.... but no real errors to report.. so assumed just tried it out at a bad commit point no idea.. maybe I will make a bug issue next time I try it.

Bachine added the bug-report Report of a bug, yet to be confirmed label Oct 13, 2022

ClashSAN added question Further information is requested and removed bug-report Report of a bug, yet to be confirmed labels Oct 13, 2022

VldmrB mentioned this issue Apr 10, 2023

Added xformers support to Llama oobabooga/text-generation-webui#950

Merged

catboxanon added asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance and removed question Further information is requested labels Aug 3, 2023

Atry pushed a commit to Atry/stable-diffusion-webui that referenced this issue Jul 11, 2024

🐛 Fix list no_control_mode_preprocessors (AUTOMATIC1111#2449)

4991520

RTX 4090 performance #2449

RTX 4090 performance #2449

Comments

Bachine commented Oct 13, 2022

Thomas-MMJ commented Oct 13, 2022 • edited Loading

Kosinkadink commented Oct 13, 2022

C43H66N12O12S2 commented Oct 13, 2022 • edited Loading

Kosinkadink commented Oct 13, 2022 • edited Loading

cmp-nct commented Oct 14, 2022

C43H66N12O12S2 commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022 • edited Loading

C43H66N12O12S2 commented Oct 15, 2022 • edited Loading

C43H66N12O12S2 commented Oct 15, 2022 • edited Loading

ilcane87 commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

boyjunqiang commented Oct 15, 2022

ilcane87 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022

SSCBryce commented Oct 15, 2022

C43H66N12O12S2 commented Oct 15, 2022 • edited Loading

sreinwald commented Mar 28, 2023

leemmcc commented Apr 2, 2023

BasedAnon commented Apr 6, 2023

SamueleLorefice commented Apr 6, 2023

BasedAnon commented Apr 7, 2023 • edited Loading

BasedAnon commented Apr 9, 2023 • edited Loading

XeonG commented Apr 9, 2023

VictorZakharov commented Apr 9, 2023

morty6688 commented Apr 9, 2023 • edited Loading

VictorZakharov commented Apr 9, 2023

rodsott commented Apr 10, 2023

morty6688 commented Apr 10, 2023 • edited Loading

haldi4803 commented Apr 10, 2023 • edited Loading

GuruVirus commented Apr 19, 2023

VictorZakharov commented Apr 19, 2023

haldi4803 commented Apr 20, 2023 • edited Loading

VictorZakharov commented Apr 20, 2023

GuruVirus commented Apr 21, 2023 • edited Loading

VictorZakharov commented Apr 23, 2023

Sirfrummel commented Apr 23, 2023 via email • edited Loading

keetruda69 commented Jun 16, 2023

VictorZakharov commented Jun 16, 2023

aifartist commented Jun 16, 2023

XeonG commented Jun 18, 2023

VictorZakharov commented Jun 18, 2023

XeonG commented Jun 19, 2023

Thomas-MMJ commented Oct 13, 2022 •

edited

Loading

C43H66N12O12S2 commented Oct 13, 2022 •

edited

Loading

Kosinkadink commented Oct 13, 2022 •

edited

Loading

C43H66N12O12S2 commented Oct 15, 2022 •

edited

Loading

C43H66N12O12S2 commented Oct 15, 2022 •

edited

Loading

C43H66N12O12S2 commented Oct 15, 2022 •

edited

Loading

C43H66N12O12S2 commented Oct 15, 2022 •

edited

Loading

BasedAnon commented Apr 7, 2023 •

edited

Loading

BasedAnon commented Apr 9, 2023 •

edited

Loading

morty6688 commented Apr 9, 2023 •

edited

Loading

morty6688 commented Apr 10, 2023 •

edited

Loading

haldi4803 commented Apr 10, 2023 •

edited

Loading

haldi4803 commented Apr 20, 2023 •

edited

Loading

GuruVirus commented Apr 21, 2023 •

edited

Loading

Sirfrummel commented Apr 23, 2023 via email •

edited

Loading