Skip to content

Make Lucene better at skipping long runs of matches. #14312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 10, 2025

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 27, 2025

This is an attempt to resurrect #12194 in a (hopefully) better way. Now that many queries run with DenseConjunctionBulkScorer, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window.

This introduces the same DocIdSetIterator#peekNextNonMatchingDocID() API that PR #12194 suggested, implements it in DocIdSetIterator#all, and uses it in DenseConjunctionBulkScorer to skip clauses that match the whole window.

For better test coverage, DenseConjunctionBulkScorer was refactored to require at least one iterator, which can be a DocIdSetIterator#all instance if all docs match.

In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper.

Closes #11915

This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that
many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc
IDs at a time, it becomes natural to skip clauses that have long runs of
matches by checking if they match the whole window.

This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that
PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in
`DenseConjunctionBulkScorer` to skip clauses that match the whole window.

For better test coverage, `DenseConjunctionBulkScorer` was refactored to
require at least one iterator, which can be a `DocIdSetIterator#all` instance
if all docs match.

In follow-ups, we should look into supporting other queries that are likely to
have long runs of matches, in particular doc-value range queries on fields that
are part of the index sort and take advantage of a doc-value skipper.

Closes apache#11915
@jpountz
Copy link
Contributor Author

jpountz commented Feb 27, 2025

cc @gf2121 who's been reviewing related PRs recently and @iverase for the connection with sparse indexing

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz , we have nextDoc for sparse docs, intoBitset for dense docs, and now we are getting this new peekNextNonMatchingDocID for sequential docs. It is exciting to see DocIdSetIterator getting smarter on various doc distributions!

}
}
// Note: iterators may be empty!
iterators = windowIterators;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lead clause might get changed in this window, iterators.get(0) need to advance windowBase again. I believe that is why CI tests get failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh right, thanks for catching. I did a large refactoring but I made sure to fix this, and added tests to DenseConjunctionBulkScorer that should catch such problems in the future.

windowIterators.clear();
for (DocIdSetIterator iterator : iterators) {
// Skip iterators that fully match the window
if (iterator.docID() > windowBase || iterator.peekNextNonMatchingDocID() < windowMax) {
Copy link
Contributor

@gf2121 gf2121 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth passing the comparison value in, like iterator.peekNextNonMatchingDocID(windowMax), so that Implementations can reduce the number the blocks it need to check according to the threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence about it because it makes testing harder (I like it being declarative, like we do for e.g. impacts or positions), and I don't expect this peekNextNonMatchingDocID call to ever be a bottleneck?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this peekNextNonMatchingDocID call to ever be a bottleneck

That makes sense, thanks for explanation!

@jpountz
Copy link
Contributor Author

jpountz commented Mar 9, 2025

I've been thinking a bit more about naming since I don't like peekNextNonMatchingDocID much, I'm thinking of renaming to docIDRunEnd (using "run" as in "run-length encoding"). I like it better because it just says that there is a run of adjacent doc IDs without implying that the next doc ID doesn't match. It's also shorter.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

}
}

if (minDocIDRunEnd >= bitsetWindowMax) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immature idea:
As we computed this minDocIDRunEnd anyway, can we just collect this range and update min to it? Probably like:

if (minDocIDRunEnd > min + 1) {
  rangeDocIdStream.from = min;
  rangeDocIdStream.to = minDocIDRunEnd;
  collector.collect(rangeDocIdStream);
  min = minDocIDRunEnd;
  if (minDocIDRunEnd >= bitsetWindowMax) {
    // We have a large range of doc IDs that all match.
    return minDocIDRunEnd;
  }
}

I wouldn't expect this to help much since it only happens at the edge of the range, but it also doesn't seem to have side effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream) (e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257). But I applied your suggestion of relaxing the window size a bit in the case when everything matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream) (e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257).

Good point, thanks!

@jpountz jpountz merged commit fe913e5 into apache:main Mar 10, 2025
7 checks passed
@jpountz jpountz deleted the skip_long_runs_of_matches branch March 10, 2025 17:08
jpountz added a commit that referenced this pull request Mar 10, 2025
This is an attempt to resurrect #12194 in a (hopefully) better way. Now that
many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc
IDs at a time, it becomes natural to skip clauses that have long runs of
matches by checking if they match the whole window.

This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that
PR #12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in
`DenseConjunctionBulkScorer` to skip clauses that match the whole window.

For better test coverage, `DenseConjunctionBulkScorer` was refactored to
require at least one iterator, which can be a `DocIdSetIterator#all` instance
if all docs match.

In follow-ups, we should look into supporting other queries that are likely to
have long runs of matches, in particular doc-value range queries on fields that
are part of the index sort and take advantage of a doc-value skipper.

Closes #11915
hanbj pushed a commit to hanbj/lucene that referenced this pull request Mar 12, 2025
This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that
many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc
IDs at a time, it becomes natural to skip clauses that have long runs of
matches by checking if they match the whole window.

This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that
PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in
`DenseConjunctionBulkScorer` to skip clauses that match the whole window.

For better test coverage, `DenseConjunctionBulkScorer` was refactored to
require at least one iterator, which can be a `DocIdSetIterator#all` instance
if all docs match.

In follow-ups, we should look into supporting other queries that are likely to
have long runs of matches, in particular doc-value range queries on fields that
are part of the index sort and take advantage of a doc-value skipper.

Closes apache#11915
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Lucene smarter about long runs of matches
2 participants