Skip to content

RTX 4090 performance #2449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Bachine opened this issue Oct 13, 2022 · 457 comments
Open

RTX 4090 performance #2449

Bachine opened this issue Oct 13, 2022 · 457 comments
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance

Comments

@Bachine
Copy link

Bachine commented Oct 13, 2022

Is the 4090 fully supported in SD?

I am getting the same performance with the 4090 that my 3070 was getting.

@Bachine Bachine added the bug-report Report of a bug, yet to be confirmed label Oct 13, 2022
@ClashSAN ClashSAN added question Further information is requested and removed bug-report Report of a bug, yet to be confirmed labels Oct 13, 2022
@Thomas-MMJ
Copy link

Thomas-MMJ commented Oct 13, 2022

what parameters are you using? You might want to do say a batch of 5+ images. Also is xformers installed and using --xformers? are you using --opt-channelslast ? half precision? etc.

@Kosinkadink
Copy link

I'm having the same issue, my 4090 is generating slower/around same speed as my 3090 used to on the same machine. I am using Windows 10, with --xformers as my only parameter. I will be setting up my 3090 in a different PC soon, so I will be able to provide some direct comparison it/s.

My "benchmark" is using just the prompt "chair" at 150 steps with all default settings (Euler a, 512x512, Scale 7, Clip Skip 1, ENSD 0).

Using --xformers doesn't seem to make a difference, either way I'm getting around 10.6 it/s.

@C43H66N12O12S2
Copy link
Collaborator

C43H66N12O12S2 commented Oct 13, 2022

Xformers currently lacks support for Lovelace (in fact, Pytorch also lacks it, I believe.)

Your quoted 3090 numbers are too low BTW. I get around 16it/s with your settings on a 3080 12GB.

I'll perform some further testing when my 4090 arrives. (and attempt to build xformers for it)

@Kosinkadink
Copy link

Kosinkadink commented Oct 13, 2022

Gotcha, guess my 4090 performance will be meh until pytorch (and xformers) gets lovelace support.

Those numbers were for my 4090, my 3090 was not plugged in at the time, but it is now.

With those same settings, my 3090 gets around 15.7 it/s without --xformers. I don't have xformers set up yet on that machine (I'm running Ubuntu and will need to use workaround to get xformers installed properly).

So the 4090 currently is only 2/3rds the performance of a non-xformers 3090.

@cmp-nct
Copy link

cmp-nct commented Oct 14, 2022

I had hit 15-18 with my 3090 but now it's 13
No command line parameters, still same setup.
Strange

@C43H66N12O12S2
Copy link
Collaborator

Preliminary testing:
4090 (or JIT PTX) really dislikes channels last. Halves performance
Without channels last, my 4090 is about 10 times slower than my 3080 with torch 1.12.1 + cu116

@C43H66N12O12S2
Copy link
Collaborator

C43H66N12O12S2 commented Oct 15, 2022

Updating cuDNN did the trick. Getting 15it/s without xformers.

To support Lovelace in xformers, we need a CUDA 11.8 build of PyTorch (I think.)

@C43H66N12O12S2
Copy link
Collaborator

C43H66N12O12S2 commented Oct 15, 2022

Nope, managed to build it. Getting a %43 speed-up compared to my 3080. 23it/s

With batch size 8, my 4090 is twice as fast compared to the 3080. ~40it/s

@C43H66N12O12S2
Copy link
Collaborator

C43H66N12O12S2 commented Oct 15, 2022

@ilcane87 @comp-nect @Kosinkadink Could you please test this wheel? This works on my 4090, but I need to make sure there isn't a regression (broken on Ampere or Pascal or whatever) with --force-enable-xformers
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

@ilcane87
Copy link

@C43H66N12O12S2
Doesn't work for me on GeForce GTX 1060 6GB:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@C43H66N12O12S2
Copy link
Collaborator

@SSCBryce
Copy link

@C43H66N12O12S2 I'd love to test this on my 4090 :D. Do I just activate venv and then pip install it while it sits in webui root dir?

@C43H66N12O12S2
Copy link
Collaborator

Yep

@SSCBryce
Copy link

@C43H66N12O12S2 Oh, also says same version, should've seen that coming... assuming I also just --force-reinstall?

@SSCBryce
Copy link

Sorry if I'm not supposed to be pinging you, don't really ever develop software... seem to have borked it. It's the same issue I was having when I was trying to build this stuff myself. This is after it failed to load and I also added --skip-torch-cuda-test to make it load at least.
TqjHNFr8bkS9 1

@C43H66N12O12S2
Copy link
Collaborator

pip seems to have replaced your torch with a CPU only one. do pip uninstall torch and start repo again.

@boyjunqiang
Copy link

On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update

@ilcane87
Copy link

@ilcane87 Please try this one. Much thanks for testing these for me :)
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

Still getting the same error on this one too.

@SSCBryce
Copy link

@C43H66N12O12S2 Is there any way to ensure it's using the new wheel? Getting about 11it/s on 512x512, 8 cfg, 40 steps, euler a on NAI (not sure all of which these affect performance so noting most settings).

@C43H66N12O12S2
Copy link
Collaborator

Well, it would error with an older wheel. You need to add --force-enable-xformers to your COMMANDLINE_ARGS, btw. (for now)

@SSCBryce
Copy link

Oh, interesting. It is in there, though, along with no-half and precision full for NAI. Not sure if the model is making the difference.

@C43H66N12O12S2
Copy link
Collaborator

Can you test your performance without xformers? I get 23it/s and --no-half should be about half that speed, so 11it/s sounds about right, actually.

@SSCBryce
Copy link

Weird. On first load without forced xformers, it never got past Commit hash:blabla. Closing the cmd and starting it again though produced a regular loadup time. Without xformers, same generation settings and cmdargs, getting about 9.6it/s. No batching.

@C43H66N12O12S2
Copy link
Collaborator

Yeah, xformers is working for you. Use larger batches for bigger gains. Also consider removing --no-half and --precision full, IME FP16 and FP32 have only minute differences but FP16 is twice the speed.

@C43H66N12O12S2
Copy link
Collaborator

@SSCBryce
Copy link

Was literally about to ask if I really need those, yeah, thanks for pointing it out! Wish more coders we're as clearly spoken and thoughtful as you are, man. Two random questions while I have your attention :D
Should this help training, too? That's what I'm currently into, and was seeing similar numbers. Should I batch that as well?
And then, is it better to increase batch count or size? Same either way? I always see a "decrease" in it/s when I batch but maybe that's just the UI giving weird numbers.

@C43H66N12O12S2
Copy link
Collaborator

C43H66N12O12S2 commented Oct 15, 2022

Yes.
Increasing size is equal to increasing count, but increasing size will also increase your speed with these large GPUs (practically anything faster than a 1070) where the VRAM bandwidth is a large bottleneck.

You're seeing a decrease because it's generating multiple images at once. To calculate what it'd correspond to for a single image, do iterations per second * batch size

@sreinwald
Copy link
Contributor

If you are getting suboptimal performance on Windows even after upgrading pytorch and replacing cuDNN .dlls check your Windows settings and see if "Hardware-accelerated GPU scheduling" is turned on.
For me (Windows 11 Pro) it was enabled by default and disabling that setting boosted my it/s by roughly 30%.

On Windows 11 you can find that setting in System > Display > Graphics > Default graphics settings.

@leemmcc
Copy link

leemmcc commented Apr 2, 2023

At this point, anyone just have a reproducable "git clone, do x, do y, get great performance" on 4090? So many conflicting instructions and partial steps in this thread by now.

@BasedAnon
Copy link

I'm getting <2 it/s with my 4090...
Did they incorporate lovelace?

@SamueleLorefice
Copy link

I'm getting <2 it/s with my 4090... Did they incorporate lovelace?

@BasedAnon less than 2 it/s? Using what settings, exactly?

At this point, anyone just have a reproducable "git clone, do x, do y, get great performance" on 4090? So many conflicting instructions and partial steps in this thread by now.

@leemmcc install linux, replace cuDNN libs, add --xformers --opt-sub-quad-attention --opt-channelslast and that's pretty much it. On windows, it's basically the same, but you also want to disable HAGS (in System > Display > Graphics > Default Graphics Settings), put your pc into high performance mode (process lasso can help with that and also with making sure it has the priority over everything) and that should be all. You will still have a bit of bottleneck anyway.

And that's as much as I know atm.

@BasedAnon
Copy link

BasedAnon commented Apr 7, 2023

@SamueleLorefice 512 x 904, Heun, Auto1111, everything else stock. (Anything V3 model with VAE)
I think it's the high resolution, I didn't realize how much slows it down.
I'm about to try ComfyUI instead.
Edit: I tried it again, at 512x512 and am now getting around 9it/s

@BasedAnon
Copy link

BasedAnon commented Apr 9, 2023

I updated to pytorch 2.0 and wow, I'm getting 16it/s+on 512x512.
The speeds seems to be about the same at larger resolutions though.

@XeonG
Copy link

XeonG commented Apr 9, 2023

can someone please do a guide for windows and updating pytorch 2.0 and which cuddn libs etc

@VictorZakharov
Copy link

@XeonG Would be better if someone introduced a config flag to A1111 to use torch 2.0.
BTW, I tried using 2.0, made no difference. Maybe I messed up somewhere.

@morty6688
Copy link

morty6688 commented Apr 9, 2023

#8696

From this post, we can use pip3 install clean-fid numba numpy torch==2.0.0+cu118 torchvision --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu118 to update. I'm using a 4080 laptop, before updating, I got 4-5 it/s. After updating, I get 12-13 it/s to generate a chair.

BTW, close HAGS seems no difference for me on win11 22621

@VictorZakharov
Copy link

@lijian12345 No effect in my case.
4070 Ti. 7 it/sec before and after.
512, 40 steps, fp16, no xformers.

@rodsott
Copy link

rodsott commented Apr 10, 2023

Lots of testing here today, to make my 4090 reaches 100% of GPU usage, like it works on System Info Benchmark, which there I have 40it/s marked.. But using the regular latest "git pull" update, with torch: 1.13.1+cu117 and xformers: 0.0.16rc425, I have a max speed of 19it/s at a capped 50% GPU usage. Tried both v11.8 and v12.1 of CUDA Runtime, and the respective cuDNN files at the .\venv\Lib\site-packages\torch\lib folder.

These tests were made at txt2img, 512x512, 20 steps, Batch size 10, with empty prompt, and checkpoint "v1-5-pruned-emaonly".

I tested with both NVIdia Studio and Game latest drivers, v.531.41.

I did a fresh install, and updated it to torch: 2.0.0+cu118, (which messed with torchvision, giving an error but still working), disabled xformers, and added the "--opt-sdp-attention" to the laucher. With that I've reached 25it/s with 60% GPU usage.. Better scenario so far, but wasn`t supposed to have the same behavior we have at the benchmark, with all the 1111 fuctionalities, like img2img, txt2img, Deforum, etc...?

Oh, I added to the .\modules\txt2img.py the following lines at the beggining, but nothing happened too:

import torch
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
torch.backends.cuda.matmul.allow_tf32 = True

Here is the results of the benchmark, using the Extra Steps and Extensive options, my 4090 reached 40it/s:

firefox_ICpdMpnWBQ

If anyone knows how to make auto1111 works at 100% CUDA usage, specially for the RTX 4090, please share a workaround here! Thanks in advance! =)

@morty6688
Copy link

morty6688 commented Apr 10, 2023

@lijian12345 No effect in my case. 4070 Ti. 7 it/sec before and after. 512, 40 steps, fp16, no xformers.

if no xformers, I got 8-9 it/s, maybe u can check if ur gpu driver or torch version is updated properly.
add --xformers --opt-sub-quad-attention --opt-channelslast can efficiently boost the iteration speed.

@haldi4803
Copy link

haldi4803 commented Apr 10, 2023

No Xformers cdnn 8500.
Xformers cdnn 8500
Xformers cdnn 8800
Torch 2.0 with --opt-sub-quad-attention --opt-channelslast added
image

@GuruVirus
Copy link

For Win 11 I got it working and wrote up this guide: https://rentry.org/31337
For Linux instructions see: https://rentry.org/installing-automatic1111

@VictorZakharov
Copy link

@GuruVirus Followed the guide and got 20.69 it/sec in the 1200 test on my 4070 Ti.
Same numbers with batch size 1, so batch 8 test is not necessary.
Thanks!

@haldi4803
Copy link

haldi4803 commented Apr 20, 2023

For Win 11 I got it working and wrote up this guide: https://rentry.org/31337 For Linux instructions see: https://rentry.org/installing-automatic1111

pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
(can skip if using --opt-sdp-no-mem-attention instead of --xformers)
If you get file path length error exit venv and run
git config --system core.longpaths true
If you get a generic compile failure, especially related to cmake, make sure you install C++ build tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/

Doesn't work. Any Special settings one need when installing C++ Build Tools?

https://pastebin.com/SGY7eQxU

image

@VictorZakharov
Copy link

@haldi4803 Use SDP instead. I had issues with xformers.

@GuruVirus
Copy link

GuruVirus commented Apr 21, 2023

I updated the guide to make it easier to skip xformers, which I recommend for simplicity.
If you want to continue with xformers, I noticed in your screenshot, in the right column, the Win11 SDK unchecked which is checked by default.
image

@VictorZakharov
Copy link

Yet another fix is to switch to using this repo:

aka Vlad1111. I had no issues getting 21 it/sec on 4070 Ti (512x512, 100 steps, euler a), without xformers (but with SDP). It is also faster to start rendering. A1111 takes a few seconds. Vlad1111 starts instantly. Which can be 50% improvement on low step 512x512 gens assuming high end video cards.

No additional configs, no tweaking, it just worked. Failed on the first run of webui.bat, but second time worked with no issues. Inpainting supported, upscale supported, pretty much all the standard features in A1111 that you might need additional setup for. Finally some competition.

@Sirfrummel
Copy link

Sirfrummel commented Apr 23, 2023 via email

@keetruda69
Copy link

I tried https://rentry.org/installing-automatic1111 but i jumped from 4it/s to 20it/s. Wasn't able to move further. I tried to install (https://github.com/vladmandic/automatic) and it did the magic. I got right to 40it/s without any tweaks. Didn't have a time to check why, but it works like charm.

@VictorZakharov
Copy link

@keetruda69 This is what happens when devs ignore quality of life updates for too long, and only focus on features or low hanging fruit.

@aifartist
Copy link

I tried https://rentry.org/installing-automatic1111 but i jumped from 4it/s to 20it/s. Wasn't able to move further. I tried to install (https://github.com/vladmandic/automatic) and it did the magic. I got right to 40it/s without any tweaks. Didn't have a time to check why, but it works like charm.

I recall seeing that it was installing pytorch 2.0.1+cu117 instead of 2.0.1+cu118 where the fix is in.
Vlado's fork gets this right.

@XeonG
Copy link

XeonG commented Jun 18, 2023

vlado fork is looks awful (dunno why they don't stick to original theme).... but that's not the issue everytime I've tried reinstalling it from fresh, it seems to be really slow at loading, navigation barely working.. I dunno just seems very broken or I just seem to try it out at the worst commit times... slow js.. automatic seems to work but yeah performance with a 4090 is just plain garbage..the python stuff just seems like a mess.. why would anyone use such a broken janky language for this stuff

@VictorZakharov
Copy link

@XeonG Open a ticket, he's pretty quick to resolve all real issues.

@XeonG
Copy link

XeonG commented Jun 19, 2023

well I would.. but everytime I've tried it.. from fresh even clearing browser cache.. the same setup I have with models vae etc loaded as automatic.. on vlads one it just seems to have trouble loading, and generally sluggish ui.. the last time didn't even work at all with laoding model checkpoint.... but no real errors to report.. so assumed just tried it out at a bad commit point no idea.. maybe I will make a bug issue next time I try it.

@catboxanon catboxanon added asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance and removed question Further information is requested labels Aug 3, 2023
Atry pushed a commit to Atry/stable-diffusion-webui that referenced this issue Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance
Projects
None yet
Development

No branches or pull requests