-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Speed up exhaustive evaluation. #14679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up exhaustive evaluation. #14679
Conversation
This change helps speed up exhaustive evaluation of term queries, ie. calling `DocIdSetIterator#nextDoc()` then `Scorer#score()` in a loop. It helps in two ways: - Iteration of matching doc IDs gets a bit more efficient, especially in the case when a block of postings is encoded as a bit set. - Computation of scores now gets (auto-)vectorized. While this change doesn't help much when dynamic pruning kicks in, I'm hopeful that we can improve this in the future.
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
Exhaustive evaluation (totalHitsThreshold=Integer.MAX_VALUE):
When dynamic pruning is enabled (Lucene's defaults):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed up is very exciting! I did a rough pass and left some minor suggestions/questions.
So this optimization can typically help cases like TOPN_COUNT
which needs to evaluate all docs, especially for the indices with deleted docs which makes count
can not return in constant time!
/** Grow both arrays to ensure that they can store at least the given number of entries. */ | ||
public void grow(int minSize) { | ||
if (docs.length < minSize) { | ||
docs = ArrayUtil.grow(docs, minSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we typically need growNoCopy
instead of grow
?
* | ||
* <p><b>NOTE</b>: The returned {@link DocAndFreqBuffer} should not hold references to internal | ||
* data structures. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify we should not call this when unpositioned?
size2 = enumerateSetBits(docBitSet.getBits()[i], i << 6, reuse.docs, size2); | ||
} | ||
assert size2 >= size : size2 + " < " + size; | ||
for (int i = 0; i < size; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this loop might get vectorized, would it be faster if just add to the base
of enumerateSetBits
? because these words typically get dense 1 bits.
enumerateSetBits(docBitSet.getBits()[i], (i << 6) + docBitSetBase, reuse.docs, size2)
/** Grow both arrays to ensure that they can store at least the given number of entries. */ | ||
public void grow(int minSize) { | ||
if (docs.length < minSize) { | ||
docs = ArrayUtil.grow(docs, minSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, growNoCopy
might be better?
* <p><b>NOTE</b>: The returned {@link DocAndScoreBuffer} should not hold references to internal | ||
* data structures. | ||
* | ||
* <p><b>NOTE</b>: In case this {@link Scorer} exposes a {@link #twoPhaseIterator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When only disi exposed, it should be positioned as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, will clarify.
* return reuse; | ||
* </pre> | ||
* | ||
* <p><b>NOTE</b>: The returned {@link DocAndFreqBuffer} should not hold references to internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the buffer arrays private and only expose getters and grow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple places where I call System#arraycopy directly on these arrays, let me think more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not applying this suggestion, or the API calls would have looked awkward. I hope this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying! Let's keep it then.
* <p>This method behaves as if implemented as below, which is the default implementation: | ||
* | ||
* <pre class="prettyprint"> | ||
* int batchSize = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When i only read this java doc without looking impls, i was thinking impls should limit their block size under 16 as well :) Maybe clarify the max size of buffer depends on data structures.
} | ||
|
||
int size = docAndFreqBuffer.size; | ||
normValues = ArrayUtil.grow(normValues, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
growNoCopy?
*/ | ||
public void score(int size, int[] freqs, long[] norms, float[] scores) { | ||
for (int i = 0; i < size; ++i) { | ||
scores[i] = score(freqs[i], norms[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computation of scores now gets (auto-)vectorized.
By this word, do you mean this method can get vectorized? So the abstraction layer do not prevent inline here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-vectorization requires score(float, long)
to get inlined indeed, which would only happen if there are two impls of SimScorer
being used at most. We may need to implement score(int, int[], long[], float[])
on our main similarities in the future to make performance more predictable. We may also be able to do a bit better than calling score
in a loop. I was trying to keep the change small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also be able to do a bit better than calling score in a loop
Yeah! I played withBM25
a bit and the result looks promising:
Benchmark Mode Cnt Score Error Units
VectorizedBM25Benchmark.scoreBaseline thrpt 5 10.991 ± 0.356 ops/us
VectorizedBM25Benchmark.scoreVector thrpt 5 15.149 ± 0.029 ops/us
public static void scoreBaseline(int size, int[] freqs, long[] norms, float[] scores, float[] cache, int weight, float[] buffer) {
for (int i = 0; i < size; ++i) {
float normInverse = cache[((byte) norms[i]) & 0xFF];
scores[i] = weight - weight / (1f + freqs[i] * normInverse);
}
}
public static void scoreVector(int size, int[] freqs, long[] norms, float[] scores, float[] cache, int weight, float[] buffer) {
for (int i = 0; i < size; ++i) {
buffer[i] = cache[((byte) norms[i]) & 0xFF];
}
for (int i = 0; i < size; ++i) {
scores[i] = weight - weight / (1f + freqs[i] * buffer[i]);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!
Right, though we don't use It should also help sparse neural search, when weights are less predictable and dynamic pruning works less well and effectively evaluates hits exhaustively in practice. Finally, I'm hoping that we can iterate on this change to also speed up top-n evaluation in the future. |
Do we really need the method on Similarity? I guess I feel, most users are probably using BM25Similarity, so I don't understand the explanation in the comments. If we have "bogus" instances (such as wrappers) of similarity in use, then that's a java problem, let's fix that instead. |
You are correct, no need for additional APIs on Similarity at this point, I removed it. I suspect it may be tempting in the future, because it enables further optimizations as @gf2121 showed in #14679 (comment) (though let's see if it actually translates to speedups with luceneutil), and because I cleaned up the change, it's now ready for review. |
Calls to `DocIdSetIterator#nextDoc`, `DocIdSetIterator#advance` and `SimScorer#score` are currently interleaved and include lots of conditionals. This builds up on apache#14679 and refactors the code a bit to make it eligible to auto-vectorization and better pipelining. This effectively speeds up conjunctive queries (e.g. `AndHighHigh`) but also disjunctive queries that run as conjunctive queries in practice (e.g. `OrHighHigh`).
Thank you! "bulkpostings 2.0" is looking really clean and non-invasive :)
Yes, thank you, I agree 100% to investigate it as followup: the additional speedup hinted at there seems promising. If we can proceed with caution there, it would help. For similarities in particular, correct formula can be difficult, and if you have to implement it twice, I have some concerns around correctness. At the very minimum we'd want to improve BaseSimilarityTestCase... For PostingsEnum/Scorer changes I have similar concerns about correctness, I think what's happening in Asserting is not enough to guarantee correctness? E.g. for the PostingsEnum one I would think about CheckIndex itself validating the new bulk API, BasePostingsFormatTestCase additions, and also TestDuelingCodecs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic job!
freq(); | ||
|
||
int start = docBufferUpto - 1; | ||
buffer.size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: buffer.size
has be set to 0 above (line 1047), can we avoid this one?
int batchSize = 16; // arbitrary | ||
buffer.growNoCopy(batchSize); | ||
int size = 0; | ||
DocIdSetIterator iterator = iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have many implementations returning a new iterator here (like TwoPhaseIterator.asDocIdSetIterator
), will the object construction for each 16 docs cause noticeable overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly indeed. Let's look into it as a follow-up? I'm not sure if we should cache the iterator here or rather fix impls to avoid allocating in #iterator()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's look into it as a follow-up
+1
int size = docAndFreqBuffer.size; | ||
normValues = ArrayUtil.growNoCopy(normValues, size); | ||
if (norms == null) { | ||
Arrays.fill(normValues, 0, size, 1L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we only do this fill when grow happens?
Thanks for the feedback, both. I added coverage to |
Arrays.fill(normValues, 1L); | ||
} | ||
} | ||
normValues = ArrayUtil.growNoCopy(normValues, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed?
CheckIndex integration is pushed, I hooked into a place where we were already exhaustively consuming the |
This change helps speed up exhaustive evaluation of term queries, ie. calling `DocIdSetIterator#nextDoc()` then `Scorer#score()` in a loop. It helps in two ways: - Iteration of matching doc IDs gets a bit more efficient, especially in the case when a block of postings is encoded as a bit set. - Computation of scores now gets (auto-)vectorized. While this change doesn't help much when dynamic pruning kicks in, I'm hopeful that we can improve this in the future.
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
Nightly benchmarks reported a ~6% speedup on the |
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
* main: (32 commits) update os.makedirs with pathlib mkdir (apache#14710) Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674) Implement #docIDRunEnd() on PostingsEnum. (apache#14693) Speed up TermQuery (apache#14709) Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701) Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706) Use a temporary repository location to download certain ecj versions ("drops") (apache#14703) Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696) Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700) Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695) Speed up exhaustive evaluation. (apache#14679) Specify and test that IOContext is immutable (apache#14686) deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691) deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692) Clean up how the test framework creates asserting scorables. (apache#14452) Make competitive iterators more robust. (apache#14532) Remove DISIDocIdStream. (apache#14550) Implement AssertingPostingsEnum#intoBitSet. (apache#14675) Fix patience knn queries to work with seeded knn queries (apache#14688) Added toString() method to BytesRefBuilder (apache#14676) ...
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at #14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at #14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
Existing vectorization of scores is a bit fragile since it relies on `SimScorer#score` being inlined in the for loops where it is called. This is currently the case in nightly benchmarks, but may not be the case in the real world where more implementations of `SimScorer` may be used, in particular those from `FeatureField`. Furthermore, existing vectorization has some room for improvement as @gf2121 highlighted at apache#14679 (comment).
This change helps speed up exhaustive evaluation of term queries, ie. calling
DocIdSetIterator#nextDoc()
thenScorer#score()
in a loop.It helps in two ways:
While this change doesn't help much when dynamic pruning kicks in, I'm hopeful that we can improve this in the future.