-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Early terminate visit BKD leaf when current value greater than upper point in sorted dim. #12528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@jpountz Please take a look when you get a chance! |
@iverase I replaced int values with static variables. Please take a look. |
public static final int HIGH_IN_SORTED_DIM = 2; | ||
/** Packed value is too high in NON-SORTED dimension */ | ||
public static final int HIGH_IN_NON_SORTED_DIM = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that the concept of SORTED dimension does not exist in the PointValues API. If you have a look to the javadocs when visiting a leaf node:
/**
* Called for all documents in a leaf cell that crosses the query. The consumer should
* scrutinize the packedValue to decide whether to accept it. In the 1D case, values are visited
* in increasing order, and in the case of ties, in increasing docID order.
*/
It only constraints the 1D case but in higher dimensions there is no constraint how data is visited. The concept of SORTED dimension sounds to me an implementation detail that should not be leaked to the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns with this change as I have the impression that we are leaking implementation details into the public API which is not a good idea in general. Maybe other folks have a different opinion but I don't think the leaves sorted dimensions should be part of the public API, or at least we need to make it explicit before this change.
…ator, byte[]) to make other queries use DocIdSetIterator.
@iverase Does it make sense to you if MatchState defined in other class, such as BKDReader or IntersectVisitor, and only leave the sorted dimension in IntersectVisitor' visit method as a parameter? |
My recommendation is to use the following method as you are just trying to flag if the visit method needs to keep processing points.
I would remove the "inverse" case as the inverse visitors are implementation details and IMHO should not be part of the API.
We should run the performance test in the [lucene benchmarks](lucene benchmarks) to check if there are slowdowns due to the extra check for each visited point. |
@iverase Thanks for your recommendation, it may makes code more clear. |
@iverase Here is a performance test from luceneutil.
I only implemented visitWithSortedDim in PointRangeQuery (without "inverse" case) temporarily, so I just ran the box's points test.
Do you mean the off the shelf benchmark in lucene self? |
One case I am interested in is the geo benchmarks. It is not clear to me if some of those queries (e.g polygon query) can effectively take advantage of this change so it would be good to know that they are not slowed down. |
@iverase
I want to add a enhanced IntersectVisitor type or add a flag to IntersectVisitor, for queries which have already implemented visitWithSortedDim. And leave queries which haven't or couldn't implement visitWithSortedDim, use the original visit method to avoid the unnecessary checks. |
@iverase I implemented visitWithSortedDim in LatLonPointDistanceQuery. Please take a look when you get a chance! BTW, The contents of getIntersectVisitor and getInverseIntersectVisitor in LatLonPointDistanceQuery and PointRangeQuery are identical. Should we move them to a DefaultIntersectVisitor and make match/relate abstract in it. I mean in a new PR without this change. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@iverase Or Do you have any other ideas to terminate visiting points, which already greater than |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Maybe this extra check can be weakened by Branch predictor |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
FWIW, I implemented "inverse" case with adding a |
I will try to measure the performance. |
Since values are sorted on the sorted dim in BKD leaf, early terminate the visit when current packed value greater than upper point in sorted dimension, may get a better performance.
Here are some performance data:
1 dim point
Visit method for low cardinality points:
Whole search for low cardinality points:
Whole search for high cardinality points:
2 dim point
Whole search for low cardinality points:
Whole search for high cardinality points: