Quadro P6000 #8
Replies: 1 comment 4 replies
-
The 8bit is slower and a balancing act between oom and NaN errors. We don't have the particular version of HW matrix multiplication newer cards do, so it uses a work-around that performs like you see. GPTQ the way I have it implemented here is fairly fast. Stock ooba doesn't support some of the other models and moved to GPTQv2. I guess it will depend on whether you had a lot of old v1 models or only v2 models and if you want GPT-NeoX support. I have not tried windows yet, and I assume to use the autograd implementation you would have to compile a cuda kernel from (https://github.com/Ph0rk0z/GPTQ-Merged). If that works there, it should be fairly easy to compare. Its about to get interesting because we are locked out of triton and the newer cuda implementation is 1/3 as fast. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I see that you have the same card as I...
The 4bit stuff works at approximately: 5.13 tokens/s
The 8bit stuff works at approximately: 1.13 tokens/s
I am using windows and have installed the regular Oobabooga in WSL and Windows "native"...
Should I try your code?
Does the 8bit stuff just suck with terrible performance due to Bitsandbytes and GPTQ only supporting newer GPUs?
Beta Was this translation helpful? Give feedback.
All reactions