Skip to content

Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset #14304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 25, 2025

Conversation

thecoop
Copy link
Contributor

@thecoop thecoop commented Feb 27, 2025

This resolves #13922. It takes the existing methods in ScalarQuantizer, and creates vectorized versions of that same algorithm.

JMH shows a ~13x speedup:

Benchmark              Mode  Cnt     Score    Error   Units
Quant.quantize        thrpt    5   235.029 ±  3.204  ops/ms
Quant.quantizeVector  thrpt    5  3153.388 ± 192.635  ops/ms

@thecoop thecoop marked this pull request as ready for review February 27, 2025 15:29
@jpountz
Copy link
Contributor

jpountz commented Feb 27, 2025

Have you been able to run luceneutil to get a sense of the indexing and search speedups?

@thecoop
Copy link
Contributor Author

thecoop commented Feb 28, 2025

Unfortunately not, I've been unable to get the quantized vector datasets working on my machine

@benwtrent
Copy link
Member

I compared this branch with main. There are measurable improvements, but the quantization step isn't the main bottle neck. Vector comparisons still dominate the costs. But, its a nice bump I would say.

candidate:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.826         2.340  500000   100      50       32        100     7 bits    86.54       5777.61         337.47             1          1859.34       1831.055       366.211

baseline:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.828         2.680  500000   100      50       32        100     7 bits    88.48       5650.74         357.45             1          1859.57       1831.055       366.211

@jpountz
Copy link
Contributor

jpountz commented Mar 7, 2025

12.5% faster search overall if I read correctly? This is pretty cool! We've been excited about smaller speedups many times in Lucene's history. :)

@jpountz
Copy link
Contributor

jpountz commented Mar 7, 2025

Hmm maybe I got confused, as quantization only needs to be applied to the query vector at query time, so the search speedup is noise and I should rather be looking at the indexing speedup (+2%) and merging speedup (+5%)?

@benwtrent
Copy link
Member

as quantization only needs to be applied to the query vector at query time, so the search speedup is noise and I should rather be looking at the indexing speedup (+2%) and merging speedup (+5%)?

That is what I think. We can run a bunch more times, but I do think this provides a marginal improvement at indexing time, where we may actually re-quantize all the vectors.

I bet for "flat" indices, that only utilize the quantization, the speed up is significant. Though I haven't had time to benchmark that yet.

v.sub(minQuantile / 2f)
.mul(minQuantile)
.add(v.sub(minQuantile).sub(dxq).mul(dxq))
.reduceLanes(VectorOperators.ADD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you collect the corrections in a float array? This way we keep all lanes parallized and then sum the floats later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you could keep the lanes separate for as long as possible, we get a bigger perf boost. Reducing lanes is a serious bottleneck.

Copy link
Contributor Author

@thecoop thecoop Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is - this doubles the performance

Benchmark              Mode  Cnt     Score    Error   Units
Quant.quantize        thrpt    5   235.029 ±  3.204  ops/ms
Quant.quantizeVector  thrpt    5  2831.313 ± 46.475  ops/ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even more with FMA operations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes ;). Thats the numbers I am expecting.

@@ -907,4 +907,87 @@ public static long int4BitDotProduct128(byte[] q, byte[] d) {
}
return subRet0 + (subRet1 << 1) + (subRet2 << 2) + (subRet3 << 3);
}

@Override
public float quantize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name this something better, we can call it "minMaxScalarQuantization" or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - and the recalculate method too

@benwtrent
Copy link
Member

Ugh, my benchmark was on my laptop, which I think counts as "not having nice byte vectors". I will attempt to benchmark correctly on a cloud machine soon-ish.

Sorry @jpountz @thecoop for the incorrect benchmark numbers :)

@benwtrent
Copy link
Member

On GCP, there isn't much difference. I wouldn't expect there to be a huge amount of difference as the dominate cost is the vector comparisons not the quantization.

I haven't tested with "flat" yet.

BASELINE

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.961         2.910  200000   100      50       64        250     7 bits     6677   111.44       1794.70          79.03             1           997.58        976.563       195.313

CANDIDATE

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.960         2.460  200000   100      50       64        250     7 bits     6527   110.99       1801.98          76.68             1           997.55        976.563       195.313

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sorry for the random and sparse feedback. I think this is almost there. Then I can take over merging and backporting.

Please add a CHANGES entry for 10.2 optimizations indicating a minor speed improvement for scalar quantized query & indexing speed.

@benwtrent benwtrent merged commit 1e8a146 into apache:main Mar 25, 2025
7 checks passed
benwtrent pushed a commit that referenced this pull request Mar 25, 2025
…eCorrectiveOffset (#14304)

This resolves #13922. It takes the existing methods in `ScalarQuantizer`, and creates vectorized versions of that same algorithm.

JMH shows a ~13x speedup:
```
Benchmark              Mode  Cnt     Score    Error   Units
Quant.quantize        thrpt    5   235.029 ±  3.204  ops/ms
Quant.quantizeVector  thrpt    5  3153.388 ± 192.635  ops/ms
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can we use Panama Vector API for quantizing vectors?
3 participants