Skip to content

PointInSetQuery early exit on non-matching segments #14268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 1, 2025

Conversation

hanbj
Copy link
Contributor

@hanbj hanbj commented Feb 21, 2025

Description

When creating a PointInSetQuery object, the data in the packedPoints parameter is returned in order, so the maximum and minimum values ​​can be determined when iterating over packedPoints.

With the maximum and minimum values, the query can be returned early in special cases during the serial search of the segment.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @hanbj! Would it make sense to add unit tests that exercise these new code paths and serve as examples for the type of situation you want to capture? Also, if I understand correctly, this is meant as an optimisation, so performance tests that show an improvement would be great!

@hanbj hanbj force-pushed the segment_clipping branch 2 times, most recently from 12c401d to 1834d4e Compare March 13, 2025 07:31
@hanbj
Copy link
Contributor Author

hanbj commented Mar 13, 2025

@stefanvodita Thank you for the review. Unit testing has been added

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 28, 2025
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization makes sense to me. I left you a few small comments. Thanks for suggesting this change!

@@ -108,6 +110,8 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac
}
if (previous == null) {
previous = new BytesRefBuilder();
lowerPoint = new byte[bytesPerDim * numDims];
System.arraycopy(current.bytes, current.offset, lowerPoint, 0, lowerPoint.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think it'd be slightly more readable if you used current.length here instead of lowerPoint.length (and I might also throw an assert lowerPoint.length == current.length immediately before this line to make it clear they should be equal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, usually the length of the copied array is used.

if (previous != null) {
BytesRef max = previous.get();
upperPoint = new byte[bytesPerDim * numDims];
System.arraycopy(max.bytes, max.offset, upperPoint, 0, upperPoint.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: same comment here about using previous.length in place of upperPoint.length. I don't feel strongly about this so please feel free to disagree if you have a different perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, usually the length of the copied array is used. I used the length of max here.

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
return null;
}

if (values.getDocCount() == 0) {
return null;
} else if (lowerPoint != null && upperPoint != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be true that either 1) both of these are null, or 2) both are non-null? I think so right? If that's right, I would check lowerPoint != null then put an assert upperPoint != null in the condition branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been modified.

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
return null;
}

if (values.getDocCount() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure of this but I don't think it's possible to get back a non-null instance from reader#getPointValues that has a zero doc count. I believe you'll always get back a null instance if the points field has no docs in a segment. Can you confirm with a test and/or some debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am referring to the implementation in PointRangeQuery here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the git annotations in PointRangeQuery, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.

@github-actions github-actions bot removed the Stale label Mar 29, 2025
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration!

@@ -248,6 +255,33 @@ public long cost() {
}
}

private boolean checkValidPointValues(PointValues values) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer going back to having this logic inlined as it was. I don't think checking values == null is really part of validating the PointValues. And there's nothing else calling this, so I think it's a bit easier to read when inlined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But I do agree we should do the validation before the optimization you introduced)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already rollback

@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
return null;
}

if (values.getDocCount() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the git annotations in PointRangeQuery, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.

@@ -186,6 +186,8 @@ Optimizations

* GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng)

* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the changes entry to 10.3 since the 10.2 branch has already been cut and I don't think we need to squeeze this in with that release?

Also, can we make this entry a little more descriptive? Maybe something like PointInSetQuery optimization for the case when no segment docs can intersect with the query values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been modified.

@hanbj hanbj changed the title PointInSetQuery clips segments by lower and upper PointInSetQuery early exit on non-matching segments Apr 1, 2025
@hanbj hanbj force-pushed the segment_clipping branch from cd30b95 to 7c264ce Compare April 1, 2025 03:27
@gsmiller
Copy link
Contributor

gsmiller commented Apr 1, 2025

This looks great! Taking care of the merge now. Thank you @hanbj !

@gsmiller gsmiller merged commit dba6c2c into apache:main Apr 1, 2025
7 checks passed
@gsmiller gsmiller added this to the 10.3.0 milestone Apr 1, 2025
jainankitk pushed a commit to jainankitk/lucene that referenced this pull request Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants