-
Notifications
You must be signed in to change notification settings - Fork 28.3k
RTX 4090 performance #2449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what parameters are you using? You might want to do say a batch of 5+ images. Also is xformers installed and using --xformers? are you using --opt-channelslast ? half precision? etc. |
I'm having the same issue, my 4090 is generating slower/around same speed as my 3090 used to on the same machine. I am using Windows 10, with --xformers as my only parameter. I will be setting up my 3090 in a different PC soon, so I will be able to provide some direct comparison it/s. My "benchmark" is using just the prompt "chair" at 150 steps with all default settings (Euler a, 512x512, Scale 7, Clip Skip 1, ENSD 0). Using --xformers doesn't seem to make a difference, either way I'm getting around 10.6 it/s. |
Xformers currently lacks support for Lovelace (in fact, Pytorch also lacks it, I believe.) Your quoted 3090 numbers are too low BTW. I get around 16it/s with your settings on a 3080 12GB. I'll perform some further testing when my 4090 arrives. (and attempt to build xformers for it) |
Gotcha, guess my 4090 performance will be meh until pytorch (and xformers) gets lovelace support. Those numbers were for my 4090, my 3090 was not plugged in at the time, but it is now. With those same settings, my 3090 gets around 15.7 it/s without --xformers. I don't have xformers set up yet on that machine (I'm running Ubuntu and will need to use workaround to get xformers installed properly). So the 4090 currently is only 2/3rds the performance of a non-xformers 3090. |
I had hit 15-18 with my 3090 but now it's 13 |
Preliminary testing: |
Updating cuDNN did the trick. Getting 15it/s without xformers. To support Lovelace in xformers, we need a CUDA 11.8 build of PyTorch (I think.) |
Nope, managed to build it. Getting a %43 speed-up compared to my 3080. 23it/s With batch size 8, my 4090 is twice as fast compared to the 3080. ~40it/s |
@ilcane87 @comp-nect @Kosinkadink Could you please test this wheel? This works on my 4090, but I need to make sure there isn't a regression (broken on Ampere or Pascal or whatever) with |
@C43H66N12O12S2
|
@ilcane87 Please try this one. Much thanks for testing these for me :) |
@C43H66N12O12S2 I'd love to test this on my 4090 :D. Do I just activate venv and then pip install it while it sits in webui root dir? |
Yep |
@C43H66N12O12S2 Oh, also says same version, should've seen that coming... assuming I also just --force-reinstall? |
pip seems to have replaced your torch with a CPU only one. do |
On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update |
Still getting the same error on this one too. |
@C43H66N12O12S2 Is there any way to ensure it's using the new wheel? Getting about 11it/s on 512x512, 8 cfg, 40 steps, euler a on NAI (not sure all of which these affect performance so noting most settings). |
Well, it would error with an older wheel. You need to add --force-enable-xformers to your COMMANDLINE_ARGS, btw. (for now) |
Oh, interesting. It is in there, though, along with no-half and precision full for NAI. Not sure if the model is making the difference. |
Can you test your performance without xformers? I get 23it/s and --no-half should be about half that speed, so 11it/s sounds about right, actually. |
Weird. On first load without forced xformers, it never got past Commit hash:blabla. Closing the cmd and starting it again though produced a regular loadup time. Without xformers, same generation settings and cmdargs, getting about 9.6it/s. No batching. |
Yeah, xformers is working for you. Use larger batches for bigger gains. Also consider removing --no-half and --precision full, IME FP16 and FP32 have only minute differences but FP16 is twice the speed. |
Was literally about to ask if I really need those, yeah, thanks for pointing it out! Wish more coders we're as clearly spoken and thoughtful as you are, man. Two random questions while I have your attention :D |
Yes. You're seeing a decrease because it's generating multiple images at once. To calculate what it'd correspond to for a single image, do |
If you are getting suboptimal performance on Windows even after upgrading pytorch and replacing cuDNN .dlls check your Windows settings and see if "Hardware-accelerated GPU scheduling" is turned on. On Windows 11 you can find that setting in |
At this point, anyone just have a reproducable "git clone, do x, do y, get great performance" on 4090? So many conflicting instructions and partial steps in this thread by now. |
I'm getting <2 it/s with my 4090... |
@BasedAnon less than 2 it/s? Using what settings, exactly?
@leemmcc install linux, replace cuDNN libs, add And that's as much as I know atm. |
@SamueleLorefice 512 x 904, Heun, Auto1111, everything else stock. (Anything V3 model with VAE) |
I updated to pytorch 2.0 and wow, I'm getting 16it/s+on 512x512. |
can someone please do a guide for windows and updating pytorch 2.0 and which cuddn libs etc |
@XeonG Would be better if someone introduced a config flag to A1111 to use torch 2.0. |
From this post, we can use BTW, close HAGS seems no difference for me on win11 22621 |
@lijian12345 No effect in my case. |
Lots of testing here today, to make my 4090 reaches 100% of GPU usage, like it works on System Info Benchmark, which there I have 40it/s marked.. But using the regular latest "git pull" update, with torch: 1.13.1+cu117 and xformers: 0.0.16rc425, I have a max speed of 19it/s at a capped 50% GPU usage. Tried both v11.8 and v12.1 of CUDA Runtime, and the respective cuDNN files at the .\venv\Lib\site-packages\torch\lib folder. These tests were made at txt2img, 512x512, 20 steps, Batch size 10, with empty prompt, and checkpoint "v1-5-pruned-emaonly". I tested with both NVIdia Studio and Game latest drivers, v.531.41. I did a fresh install, and updated it to torch: 2.0.0+cu118, (which messed with torchvision, giving an error but still working), disabled xformers, and added the "--opt-sdp-attention" to the laucher. With that I've reached 25it/s with 60% GPU usage.. Better scenario so far, but wasn`t supposed to have the same behavior we have at the benchmark, with all the 1111 fuctionalities, like img2img, txt2img, Deforum, etc...? Oh, I added to the .\modules\txt2img.py the following lines at the beggining, but nothing happened too: import torch Here is the results of the benchmark, using the Extra Steps and Extensive options, my 4090 reached 40it/s: If anyone knows how to make auto1111 works at 100% CUDA usage, specially for the RTX 4090, please share a workaround here! Thanks in advance! =) |
if no xformers, I got 8-9 it/s, maybe u can check if ur gpu driver or torch version is updated properly. |
For Win 11 I got it working and wrote up this guide: https://rentry.org/31337 |
@GuruVirus Followed the guide and got 20.69 it/sec in the 1200 test on my 4070 Ti. |
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers Doesn't work. Any Special settings one need when installing C++ Build Tools? |
@haldi4803 Use SDP instead. I had issues with xformers. |
Yet another fix is to switch to using this repo: aka Vlad1111. I had no issues getting 21 it/sec on 4070 Ti (512x512, 100 steps, euler a), without xformers (but with SDP). It is also faster to start rendering. A1111 takes a few seconds. Vlad1111 starts instantly. Which can be 50% improvement on low step 512x512 gens assuming high end video cards. No additional configs, no tweaking, it just worked. Failed on the first run of webui.bat, but second time worked with no issues. Inpainting supported, upscale supported, pretty much all the standard features in A1111 that you might need additional setup for. Finally some competition. |
Yeah, I just heard about this (from the same youtube video), I'm excited to
try it out.
One thing I didn't hear, was that are automatic1111 extensions supported,
or do we have to wait for extension authors to update to be compatible?
I'm thinking of things like dynamic prompts, or a specific depth map plugin
I used to make 3D videos ?
|
I tried https://rentry.org/installing-automatic1111 but i jumped from 4it/s to 20it/s. Wasn't able to move further. I tried to install (https://github.com/vladmandic/automatic) and it did the magic. I got right to 40it/s without any tweaks. Didn't have a time to check why, but it works like charm. |
@keetruda69 This is what happens when devs ignore quality of life updates for too long, and only focus on features or low hanging fruit. |
I recall seeing that it was installing pytorch 2.0.1+cu117 instead of 2.0.1+cu118 where the fix is in. |
vlado fork is looks awful (dunno why they don't stick to original theme).... but that's not the issue everytime I've tried reinstalling it from fresh, it seems to be really slow at loading, navigation barely working.. I dunno just seems very broken or I just seem to try it out at the worst commit times... slow js.. automatic seems to work but yeah performance with a 4090 is just plain garbage..the python stuff just seems like a mess.. why would anyone use such a broken janky language for this stuff |
@XeonG Open a ticket, he's pretty quick to resolve all real issues. |
well I would.. but everytime I've tried it.. from fresh even clearing browser cache.. the same setup I have with models vae etc loaded as automatic.. on vlads one it just seems to have trouble loading, and generally sluggish ui.. the last time didn't even work at all with laoding model checkpoint.... but no real errors to report.. so assumed just tried it out at a bad commit point no idea.. maybe I will make a bug issue next time I try it. |
Is the 4090 fully supported in SD?
I am getting the same performance with the 4090 that my 3070 was getting.
The text was updated successfully, but these errors were encountered: