Make Lucene better at skipping long runs of matches. #14312

jpountz · 2025-02-27T22:06:09Z

This is an attempt to resurrect #12194 in a (hopefully) better way. Now that many queries run with DenseConjunctionBulkScorer, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window.

This introduces the same DocIdSetIterator#peekNextNonMatchingDocID() API that PR #12194 suggested, implements it in DocIdSetIterator#all, and uses it in DenseConjunctionBulkScorer to skip clauses that match the whole window.

For better test coverage, DenseConjunctionBulkScorer was refactored to require at least one iterator, which can be a DocIdSetIterator#all instance if all docs match.

In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper.

Closes #11915

This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes apache#11915

jpountz · 2025-02-27T22:07:23Z

cc @gf2121 who's been reviewing related PRs recently and @iverase for the connection with sparse indexing

gf2121

Thanks @jpountz , we have nextDoc for sparse docs, intoBitset for dense docs, and now we are getting this new peekNextNonMatchingDocID for sequential docs. It is exciting to see DocIdSetIterator getting smarter on various doc distributions!

gf2121 · 2025-03-03T06:34:00Z

lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java

+      }
+    }
+    // Note: iterators may be empty!
+    iterators = windowIterators;


lead clause might get changed in this window, iterators.get(0) need to advance windowBase again. I believe that is why CI tests get failed.

Ahhh right, thanks for catching. I did a large refactoring but I made sure to fix this, and added tests to DenseConjunctionBulkScorer that should catch such problems in the future.

gf2121 · 2025-03-03T06:42:00Z

lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java

+    windowIterators.clear();
+    for (DocIdSetIterator iterator : iterators) {
+      // Skip iterators that fully match the window
+      if (iterator.docID() > windowBase || iterator.peekNextNonMatchingDocID() < windowMax) {


Would it be worth passing the comparison value in, like iterator.peekNextNonMatchingDocID(windowMax), so that Implementations can reduce the number the blocks it need to check according to the threshold?

I'm a bit on the fence about it because it makes testing harder (I like it being declarative, like we do for e.g. impacts or positions), and I don't expect this peekNextNonMatchingDocID call to ever be a bottleneck?

I don't expect this peekNextNonMatchingDocID call to ever be a bottleneck

That makes sense, thanks for explanation!

jpountz · 2025-03-09T12:32:20Z

I've been thinking a bit more about naming since I don't like peekNextNonMatchingDocID much, I'm thinking of renaming to docIDRunEnd (using "run" as in "run-length encoding"). I like it better because it just says that there is a run of adjacent doc IDs without implying that the next doc ID doesn't match. It's also shorter.

gf2121

Nice work!

gf2121 · 2025-03-10T07:50:57Z

lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java

+        }
+      }
+
+      if (minDocIDRunEnd >= bitsetWindowMax) {


Immature idea:
As we computed this minDocIDRunEnd anyway, can we just collect this range and update min to it? Probably like:

if (minDocIDRunEnd > min + 1) { rangeDocIdStream.from = min; rangeDocIdStream.to = minDocIDRunEnd; collector.collect(rangeDocIdStream); min = minDocIDRunEnd; if (minDocIDRunEnd >= bitsetWindowMax) { // We have a large range of doc IDs that all match. return minDocIDRunEnd; } }

I wouldn't expect this to help much since it only happens at the edge of the range, but it also doesn't seem to have side effects.

I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream) (e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257). But I applied your suggestion of relaxing the window size a bit in the case when everything matches.

I would like to avoid collecting tiny windows of doc IDs at once, so that collectors can feel free to apply logic that has some overhead in LeafCollector#collect(DocIdStream) (e.g. https://github.com/apache/lucene/pull/14273/files#diff-05525bb5769d4251279bcd9c76d259f8eb451a16075af357e69bd98890c3db5bR257).

Good point, thanks!

This is an attempt to resurrect #12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR #12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes #11915

This is an attempt to resurrect apache#12194 in a (hopefully) better way. Now that many queries run with `DenseConjunctionBulkScorer`, which scores windows of doc IDs at a time, it becomes natural to skip clauses that have long runs of matches by checking if they match the whole window. This introduces the same `DocIdSetIterator#peekNextNonMatchingDocID()` API that PR apache#12194 suggested, implements it in `DocIdSetIterator#all`, and uses it in `DenseConjunctionBulkScorer` to skip clauses that match the whole window. For better test coverage, `DenseConjunctionBulkScorer` was refactored to require at least one iterator, which can be a `DocIdSetIterator#all` instance if all docs match. In follow-ups, we should look into supporting other queries that are likely to have long runs of matches, in particular doc-value range queries on fields that are part of the index sort and take advantage of a doc-value skipper. Closes apache#11915

jpountz added 2 commits February 27, 2025 23:00

test

ca7637e

github-actions bot added module:core/search module:test-framework labels Feb 27, 2025

tidy

40f4042

gf2121 reviewed Mar 3, 2025

View reviewed changes

iter

b4409ae

jpountz added 2 commits March 9, 2025 21:30

Rename peekNextNonMatchingDocID -> docIDRunEnd

b20472b

Handle competitive iterator like other clauses

7db5110

gf2121 approved these changes Mar 10, 2025

View reviewed changes

jpountz added 3 commits March 10, 2025 13:27

Merge branch 'main' into skip_long_runs_of_matches

c9c3c74

Relax window size a bit when all clauses match.

f0f6918

CHANGES

55c86f3

gf2121 approved these changes Mar 10, 2025

View reviewed changes

jpountz merged commit fe913e5 into apache:main Mar 10, 2025
7 checks passed

jpountz deleted the skip_long_runs_of_matches branch March 10, 2025 17:08

jpountz mentioned this pull request Mar 26, 2025

Speed up histogram collection in a similar way as disjunction counts. #14273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Lucene better at skipping long runs of matches. #14312

Make Lucene better at skipping long runs of matches. #14312

jpountz commented Feb 27, 2025

jpountz commented Feb 27, 2025

gf2121 left a comment

gf2121 Mar 3, 2025

jpountz Mar 9, 2025

gf2121 Mar 3, 2025 •

edited

Loading

jpountz Mar 9, 2025

gf2121 Mar 10, 2025

jpountz commented Mar 9, 2025

gf2121 left a comment

gf2121 Mar 10, 2025

jpountz Mar 10, 2025

gf2121 Mar 10, 2025

Make Lucene better at skipping long runs of matches. #14312

Make Lucene better at skipping long runs of matches. #14312

Conversation

jpountz commented Feb 27, 2025

jpountz commented Feb 27, 2025

gf2121 left a comment

Choose a reason for hiding this comment

gf2121 Mar 3, 2025

Choose a reason for hiding this comment

jpountz Mar 9, 2025

Choose a reason for hiding this comment

gf2121 Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

jpountz Mar 9, 2025

Choose a reason for hiding this comment

gf2121 Mar 10, 2025

Choose a reason for hiding this comment

jpountz commented Mar 9, 2025

gf2121 left a comment

Choose a reason for hiding this comment

gf2121 Mar 10, 2025

Choose a reason for hiding this comment

jpountz Mar 10, 2025

Choose a reason for hiding this comment

gf2121 Mar 10, 2025

Choose a reason for hiding this comment

gf2121 Mar 3, 2025 •

edited

Loading