-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset #14304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eCorrectiveOffset
Have you been able to run |
Unfortunately not, I've been unable to get the quantized vector datasets working on my machine |
I compared this branch with main. There are measurable improvements, but the quantization step isn't the main bottle neck. Vector comparisons still dominate the costs. But, its a nice bump I would say. candidate:
baseline:
|
12.5% faster search overall if I read correctly? This is pretty cool! We've been excited about smaller speedups many times in Lucene's history. :) |
Hmm maybe I got confused, as quantization only needs to be applied to the query vector at query time, so the search speedup is noise and I should rather be looking at the indexing speedup (+2%) and merging speedup (+5%)? |
That is what I think. We can run a bunch more times, but I do think this provides a marginal improvement at indexing time, where we may actually re-quantize all the vectors. I bet for "flat" indices, that only utilize the quantization, the speed up is significant. Though I haven't had time to benchmark that yet. |
v.sub(minQuantile / 2f) | ||
.mul(minQuantile) | ||
.add(v.sub(minQuantile).sub(dxq).mul(dxq)) | ||
.reduceLanes(VectorOperators.ADD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you collect the corrections in a float array? This way we keep all lanes parallized and then sum the floats later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you could keep the lanes separate for as long as possible, we get a bigger perf boost. Reducing lanes is a serious bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is - this doubles the performance
Benchmark Mode Cnt Score Error Units
Quant.quantize thrpt 5 235.029 ± 3.204 ops/ms
Quant.quantizeVector thrpt 5 2831.313 ± 46.475 ops/ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And even more with FMA operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes ;). Thats the numbers I am expecting.
@@ -907,4 +907,87 @@ public static long int4BitDotProduct128(byte[] q, byte[] d) { | |||
} | |||
return subRet0 + (subRet1 << 1) + (subRet2 << 2) + (subRet3 << 3); | |||
} | |||
|
|||
@Override | |||
public float quantize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's name this something better, we can call it "minMaxScalarQuantization" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - and the recalculate method too
On GCP, there isn't much difference. I wouldn't expect there to be a huge amount of difference as the dominate cost is the vector comparisons not the quantization. I haven't tested with "flat" yet. BASELINE
CANDIDATE
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sorry for the random and sparse feedback. I think this is almost there. Then I can take over merging and backporting.
Please add a CHANGES
entry for 10.2 optimizations indicating a minor speed improvement for scalar quantized query & indexing speed.
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java
Show resolved
Hide resolved
…eCorrectiveOffset (#14304) This resolves #13922. It takes the existing methods in `ScalarQuantizer`, and creates vectorized versions of that same algorithm. JMH shows a ~13x speedup: ``` Benchmark Mode Cnt Score Error Units Quant.quantize thrpt 5 235.029 ± 3.204 ops/ms Quant.quantizeVector thrpt 5 3153.388 ± 192.635 ops/ms ```
This resolves #13922. It takes the existing methods in
ScalarQuantizer
, and creates vectorized versions of that same algorithm.JMH shows a ~13x speedup: