Skip to content

Reduce the number of comparisons when lowerPoint is equal to upperPoint #14267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

hanbj
Copy link
Contributor

@hanbj hanbj commented Feb 21, 2025

Description

When lowerPoint is equal to upperPoint. In fact, there is no need to compare lowerPoint and upperPoint at the same time. The number of comparisons can be reduced by half when collecting document ids to construct bitsets and deeply traversing the bkd tree.
During my reading of Elasticsearch related code, I found that when executing term or terms queries on the date field, the query is rewritten, and a single term is rewritten as a rang query (lowerTerm is the same as upperTerm). The terms query uses BooleanQuery to encapsulate multiple range queries. Therefore, it is more suitable for this scenario, reducing the number of comparisons and improving performance when collecting a large number of documents.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's taken a while to review. Have you checked whether the failing test case is related? Do we have any performance tests?

Edit: Retry succeeded, but it's still worth checking that previous run. I haven't had a look at it yet.

@hanbj
Copy link
Contributor Author

hanbj commented Mar 12, 2025

@stefanvodita The previous failed test case was org.apache.Lucene.index TestKnnGraph.testMultiThreadedSearch.
I have confirmed the testMultiThreadedSearch method, which uses KnnFloatVectorQuery for search and does not use PointRangeQuery, so it is not relevant to this range query.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 28, 2025
Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanbj for creating this PR. Can you make suggested refactoring to avoid any potential regression for the lowerPoint != upperPoint code path, which is also more common? Can you also add unit test for this new code path?

Comment on lines 145 to 182
if (equalValues) {
for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) {
if (comparator.compare(packedValue, offset, lowerPoint, offset) != 0) {
return false;
}
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we know about equalValues being true/false while initializing the PointRangeQuery, I would have a separate weight object, instead of having this additional logic when lowerPoint != upperPoint. For example :

 @Override
  public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)
      throws IOException {
    if (this.equalValues) {
      return new ConstantScoreWeight(this, boost) {....}
    }
    // We don't use RandomAccessWeight here: it's no good to approximate with "match all docs".
    // This is an inverted structure and should be used in the first pass:
    return new ConstantScoreWeight(this, boost) {....}
 }

Copy link
Contributor Author

@hanbj hanbj Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement it first. Thank you for your hard work in the review.

@github-actions github-actions bot removed the Stale label Apr 1, 2025
@hanbj hanbj force-pushed the single_value_range branch 2 times, most recently from 50ded47 to 699c280 Compare April 1, 2025 10:13
Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanbj for refactoring the code and adding the unit test. Looks much better now. Just a minor suggestion to improve the readability further.

* Essentially, it is to reduce the number of comparisons. This is an optimization, used for the
* case of lowerPoint==upperPoint.
*/
protected class SinglePointConstantScoreWeight extends MultiPointsConstantScoreWeight {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming we are reusing some of the methods from MultiPointsConstantScoreWeight. That's why we are extending from that class. May, I suggest creating class say PointRangeQueryWeight that extends from ConstantScoreWeight? And, both SinglePointRangeQueryWeight and MultiPointRangeQueryWeight extend from PointRangeQueryWeight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion is great, SinglePointRangeQueryWeight and MultiplaPointRangeQueryWeight only need to implement their own point value matching logic and relationship judgment.

@hanbj hanbj force-pushed the single_value_range branch from 699c280 to 471179e Compare April 2, 2025 06:48
@jainankitk
Copy link
Contributor

@hanbj - Thanks for patiently addressing the review comments. While I don't see any performance regression risk myself, I am wondering if we can do one quick performance benchmark run, just to ensure we are not missing anything obvious?

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this! I didn't review in deep detail but commented on a couple API/visibility related concerns that jumped out to me.

Also, +1 to running some benchmarks on this before merging to ensure we're not regressing current behavior. Adding this polymorphic indirection may actually hurt performance in interesting, non-obvious ways and we should verify this is actually beneficial.

As an alternative direction, I'd also be curious how PointInSetQuery with a single point performs. A really simple thing to try would be to change the query factory methods (e.g., LongPoint#newExactQuery) to build a set query instead of range if it's better. Or another option could be to create a specialized query that does an exact point comparison. Benchmarks would be a great help in figuring out what of these approaches is best.

Thanks again!

* <p>Optimize query performance by reducing the number of comparisons between dimensions. This
* implementation is used when the upper and lower bounds of all dimensions are exactly the same.
*/
protected class SinglePointRangeQueryWeight extends PointRangeQueryWeight {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually need to be protected instead of private? (Same question for MultiPointRangeQueryWeight and PointRangeQueryWeight).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they can be private in my opinion as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While MultiPointRangeQueryWeight and SinglePointRangeQueryWeight are used for asserting the weight instance type, PointRangeQueryWeight can be made private

@@ -517,6 +623,11 @@ public byte[] getUpperPoint() {
return upperPoint.clone();
}

// for test
public boolean isEqualValues() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer we do not increase our public API surface area only for testing on a class like this. Can we find another way to test without exposing this please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @gsmiller! Can we make this package-private?

@hanbj
Copy link
Contributor Author

hanbj commented Apr 9, 2025

@gsmiller @jainankitk I haven't carefully studied the implementation of benchmark sin Lucene too, which may take sometime.

@jainankitk
Copy link
Contributor

@gsmiller @jainankitk I haven't carefully studied the implementation of benchmark sin Lucene too, which may take sometime.

@hanbj - Please take your time. You can follow the instructions here - https://github.com/mikemccand/luceneutil/blob/main/README.md. It is fairly straightforward, did it recently for another PR, coincidentally in PointRangeQuery only. Please let me know if you get stuck anywhere

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@jainankitk
Copy link
Contributor

Closing in favor of linked PR #14625 that addresses review comments with performance benchmark results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants