-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Make Lucene better at skipping long runs of matches. #14312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes apache#11915
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz , we have nextDoc
for sparse docs, intoBitset
for dense docs, and now we are getting this new peekNextNonMatchingDocID
for sequential docs. It is exciting to see DocIdSetIterator
getting smarter on various doc distributions!
} | ||
} | ||
// Note: iterators may be empty! | ||
iterators = windowIterators; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lead clause might get changed in this window, iterators.get(0)
need to advance windowBase
again. I believe that is why CI tests get failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh right, thanks for catching. I did a large refactoring but I made sure to fix this, and added tests to DenseConjunctionBulkScorer that should catch such problems in the future.
windowIterators.clear(); | ||
for (DocIdSetIterator iterator : iterators) { | ||
// Skip iterators that fully match the window | ||
if (iterator.docID() > windowBase || iterator.peekNextNonMatchingDocID() < windowMax) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth passing the comparison value in, like iterator.peekNextNonMatchingDocID(windowMax)
, so that Implementations can reduce the number the blocks it need to check according to the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence about it because it makes testing harder (I like it being declarative, like we do for e.g. impacts or positions), and I don't expect this peekNextNonMatchingDocID
call to ever be a bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this peekNextNonMatchingDocID call to ever be a bottleneck
That makes sense, thanks for explanation!
I've been thinking a bit more about naming since I don't like peekNextNonMatchingDocID much, I'm thinking of renaming to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
} | ||
} | ||
|
||
if (minDocIDRunEnd >= bitsetWindowMax) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Immature idea:
As we computed this minDocIDRunEnd
anyway, can we just collect this range and update min to it? Probably like:
if (minDocIDRunEnd > min + 1) {
rangeDocIdStream.from = min;
rangeDocIdStream.to = minDocIDRunEnd;
collector.collect(rangeDocIdStream);
min = minDocIDRunEnd;
if (minDocIDRunEnd >= bitsetWindowMax) {
// We have a large range of doc IDs that all match.
return minDocIDRunEnd;
}
}
I wouldn't expect this to help much since it only happens at the edge of the range, but it also doesn't seem to have side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream)
(e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257). But I applied your suggestion of relaxing the window size a bit in the case when everything matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream) (e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257).
Good point, thanks!
This is an attempt to resurrect #12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR #12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes #11915
This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes apache#11915
This is an attempt to resurrect #12194 in a (hopefully) better way. Now that many queries run with
DenseConjunctionBulkScorer
, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window.This introduces the same
DocIdSetIterator#peekNextNonMatchingDocID()
API that PR #12194 suggested, implements it inDocIdSetIterator#all
, and uses it inDenseConjunctionBulkScorer
to skip clauses that match the whole window.For better test coverage,
DenseConjunctionBulkScorer
was refactored to require at least one iterator, which can be aDocIdSetIterator#all
instance if all docs match.In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper.
Closes #11915