-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PointInSetQuery early exit on non-matching segments #14268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @hanbj! Would it make sense to add unit tests that exercise these new code paths and serve as examples for the type of situation you want to capture? Also, if I understand correctly, this is meant as an optimisation, so performance tests that show an improvement would be great!
12c401d
to
1834d4e
Compare
@stefanvodita Thank you for the review. Unit testing has been added |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization makes sense to me. I left you a few small comments. Thanks for suggesting this change!
@@ -108,6 +110,8 @@ protected PointInSetQuery(String field, int numDims, int bytesPerDim, Stream pac | |||
} | |||
if (previous == null) { | |||
previous = new BytesRefBuilder(); | |||
lowerPoint = new byte[bytesPerDim * numDims]; | |||
System.arraycopy(current.bytes, current.offset, lowerPoint, 0, lowerPoint.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think it'd be slightly more readable if you used current.length
here instead of lowerPoint.length
(and I might also throw an assert lowerPoint.length == current.length
immediately before this line to make it clear they should be equal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, usually the length of the copied array is used.
if (previous != null) { | ||
BytesRef max = previous.get(); | ||
upperPoint = new byte[bytesPerDim * numDims]; | ||
System.arraycopy(max.bytes, max.offset, upperPoint, 0, upperPoint.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: same comment here about using previous.length
in place of upperPoint.length
. I don't feel strongly about this so please feel free to disagree if you have a different perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, usually the length of the copied array is used. I used the length of max here.
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti | |||
return null; | |||
} | |||
|
|||
if (values.getDocCount() == 0) { | |||
return null; | |||
} else if (lowerPoint != null && upperPoint != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be true that either 1) both of these are null, or 2) both are non-null? I think so right? If that's right, I would check lowerPoint != null
then put an assert upperPoint != null
in the condition branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been modified.
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti | |||
return null; | |||
} | |||
|
|||
if (values.getDocCount() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure of this but I don't think it's possible to get back a non-null instance from reader#getPointValues
that has a zero doc count. I believe you'll always get back a null instance if the points field has no docs in a segment. Can you confirm with a test and/or some debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am referring to the implementation in PointRangeQuery here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the git annotations in PointRangeQuery
, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue
at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration!
@@ -248,6 +255,33 @@ public long cost() { | |||
} | |||
} | |||
|
|||
private boolean checkValidPointValues(PointValues values) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer going back to having this logic inlined as it was. I don't think checking values == null
is really part of validating the PointValues
. And there's nothing else calling this, so I think it's a bit easier to read when inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But I do agree we should do the validation before the optimization you introduced)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already rollback
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti | |||
return null; | |||
} | |||
|
|||
if (values.getDocCount() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the git annotations in PointRangeQuery
, these checks were added in two different changes. I think it's likely this was just overlooked. I do not believe it's possible to have a non-null PointValue
at this point that returns a zero doc count. (I also played around with this using some unit tests and a debugger and can confirm that behavior). All that said, I'm not strongly opposed to leaving the check in there.
lucene/CHANGES.txt
Outdated
@@ -186,6 +186,8 @@ Optimizations | |||
|
|||
* GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo Feng) | |||
|
|||
* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the changes entry to 10.3 since the 10.2 branch has already been cut and I don't think we need to squeeze this in with that release?
Also, can we make this entry a little more descriptive? Maybe something like PointInSetQuery optimization for the case when no segment docs can intersect with the query values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been modified.
This looks great! Taking care of the merge now. Thank you @hanbj ! |
Description
When creating a PointInSetQuery object, the data in the packedPoints parameter is returned in order, so the maximum and minimum values can be determined when iterating over packedPoints.
With the maximum and minimum values, the query can be returned early in special cases during the serial search of the segment.