PointInSetQuery early exit on non-matching segments #14268

hanbj · 2025-02-21T06:32:00Z

Description

When creating a PointInSetQuery object, the data in the packedPoints parameter is returned in order, so the maximum and minimum values can be determined when iterating over packedPoints.

With the maximum and minimum values, the query can be returned early in special cases during the serial search of the segment.

stefanvodita

Thank you for your contribution @hanbj! Would it make sense to add unit tests that exercise these new code paths and serve as examples for the type of situation you want to capture? Also, if I understand correctly, this is meant as an optimisation, so performance tests that show an improvement would be great!

hanbj · 2025-03-13T09:16:48Z

@stefanvodita Thank you for the review. Unit testing has been added

github-actions · 2025-03-28T00:24:29Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

gsmiller

This optimization makes sense to me. I left you a few small comments. Thanks for suggesting this change!

gsmiller · 2025-03-28T14:42:45Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

@@ -108,6 +110,8 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac
      }
      if (previous == null) {
        previous = new BytesRefBuilder();
+        lowerPoint = new byte[bytesPerDim * numDims];
+        System.arraycopy(current.bytes, current.offset, lowerPoint, 0, lowerPoint.length);


minor: I think it'd be slightly more readable if you used current.length here instead of lowerPoint.length (and I might also throw an assert lowerPoint.length == current.length immediately before this line to make it clear they should be equal).

You're right, usually the length of the copied array is used.

gsmiller · 2025-03-28T14:45:33Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

+    if (previous != null) {
+      BytesRef max = previous.get();
+      upperPoint = new byte[bytesPerDim * numDims];
+      System.arraycopy(max.bytes, max.offset, upperPoint, 0, upperPoint.length);


minor: same comment here about using previous.length in place of upperPoint.length. I don't feel strongly about this so please feel free to disagree if you have a different perspective.

You're right, usually the length of the copied array is used. I used the length of max here.

gsmiller · 2025-03-28T14:46:15Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
          return null;
        }

+        if (values.getDocCount() == 0) {
+          return null;
+        } else if (lowerPoint != null && upperPoint != null) {


Should it be true that either 1) both of these are null, or 2) both are non-null? I think so right? If that's right, I would check lowerPoint != null then put an assert upperPoint != null in the condition branch.

This has been modified.

gsmiller · 2025-03-28T14:49:44Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
          return null;
        }

+        if (values.getDocCount() == 0) {


I'm not 100% sure of this but I don't think it's possible to get back a non-null instance from reader#getPointValues that has a zero doc count. I believe you'll always get back a null instance if the points field has no docs in a segment. Can you confirm with a test and/or some debugging?

I am referring to the implementation in PointRangeQuery here.

Looking at the git annotations in PointRangeQuery, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.

gsmiller

Thanks for the iteration!

gsmiller · 2025-03-31T16:33:09Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

@@ -248,6 +255,33 @@ public long cost() {
        }
      }

+      private boolean checkValidPointValues(PointValues values) throws IOException {


I'd prefer going back to having this logic inlined as it was. I don't think checking values == null is really part of validating the PointValues. And there's nothing else calling this, so I think it's a bit easier to read when inlined.

(But I do agree we should do the validation before the optimization you introduced)

Already rollback

gsmiller · 2025-03-31T16:36:13Z

lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
          return null;
        }

+        if (values.getDocCount() == 0) {


Looking at the git annotations in PointRangeQuery, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.

gsmiller · 2025-03-31T16:40:26Z

lucene/CHANGES.txt

@@ -186,6 +186,8 @@ Optimizations

 * GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng)

+* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj)


Let's move the changes entry to 10.3 since the 10.2 branch has already been cut and I don't think we need to squeeze this in with that release?

Also, can we make this entry a little more descriptive? Maybe something like PointInSetQuery optimization for the case when no segment docs can intersect with the query values?

This has been modified.

gsmiller · 2025-04-01T14:51:24Z

This looks great! Taking care of the merge now. Thank you @hanbj !

github-actions bot added the module:core/search label Feb 21, 2025

stefanvodita reviewed Mar 7, 2025

View reviewed changes

hanbj force-pushed the segment_clipping branch 2 times, most recently from 12c401d to 1834d4e Compare March 13, 2025 07:31

github-actions bot added the Stale label Mar 28, 2025

gsmiller reviewed Mar 28, 2025

View reviewed changes

github-actions bot removed the Stale label Mar 29, 2025

gsmiller reviewed Mar 31, 2025

View reviewed changes

hanbj changed the title ~~PointInSetQuery clips segments by lower and upper~~ PointInSetQuery early exit on non-matching segments Apr 1, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 1, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 1, 2025

hanbj added 3 commits April 1, 2025 11:15

PointInSetQuery clips segments by lower and upper

3cfa94a

add a test and a CHANGES

bc0211b

code format and changes

7c264ce

hanbj force-pushed the segment_clipping branch from cd30b95 to 7c264ce Compare April 1, 2025 03:27

gsmiller merged commit dba6c2c into apache:main Apr 1, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Apr 1, 2025

gsmiller pushed a commit that referenced this pull request Apr 1, 2025

PointInSetQuery early exit on non-matching segments (#14268)

0b1e222

gsmiller added this to the 10.3.0 milestone Apr 1, 2025

jainankitk pushed a commit to jainankitk/lucene that referenced this pull request Apr 28, 2025

PointInSetQuery early exit on non-matching segments (apache#14268)

50bfd06

		@@ -186,6 +186,8 @@ Optimizations

		* GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng)

		* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj)

PointInSetQuery early exit on non-matching segments #14268

PointInSetQuery early exit on non-matching segments #14268

Uh oh!

Conversation

hanbj commented Feb 21, 2025

Description

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

hanbj commented Mar 13, 2025

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

gsmiller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsmiller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsmiller commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!