-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Enable collectors to take advantage of pre-aggregated data. #14401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This introduces `LeafCollector#collectRange`, which is typically useful to take advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the moment, `DocValuesSkipper` only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well. This `collectRange` method would be called if there are no deletions in the segment by: - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc), - `PointRangeQuery` on segments that fully match the range (typical for time-based data), - doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.
@epotyom You may be interested in this, this allows computing aggregates in sub-linear time respective to the number of matching docs. |
It makes sense to me to expose the idea of doc range collection as a first-class API on leaf collectors for the reasons you outlined above. This would also benefit #14273 as well right? If there are scorers that can leverage the range collection call, it would immediately benefit I believe. |
I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with For clarity, this |
private final int min, max; | ||
|
||
RangeDocIdStream(int min, int max) { | ||
if (max < min) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to allow min==max
, which is actually an empty range? We need to update this or the exception message anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question, I had overlooked it. While I don't think that empty ranges would cause issues for any impl, I think we should reject them. I'll update the PR.
* <p>Extending this method is typically useful to take advantage of pre-aggregated data exposed | ||
* in a {@link DocValuesSkipper}. | ||
* | ||
* <p>The default implementation calls {@link #collect(DocIdStream)} on a {@link DocIdStream} that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify if we have any guarantee on the given values like max > min
?
My thinking here was that |
Ah, that's right. We have a good number of queries that are already covered, in my opinion the next natural step is to look into making ranges collect ranges when any clause would collect a range. |
Any opinion on |
I prefer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Override | ||
public void collectRange(int min, int max) throws IOException { | ||
assert min > lastCollected; | ||
assert max > min; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assert min >= this.min
and max <= this.max
as well :)
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a CHANGES entry.
The change of #14421 is also included, which seems not expected? |
It is unexpected indeed! I'll fix this and add a CHANGES entry. |
This introduces `LeafCollector#collectRange`, which is typically useful to take advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the moment, `DocValuesSkipper` only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well. This `collectRange` method would be called if there are no deletions in the segment by: - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc), - `PointRangeQuery` on segments that fully match the range (typical for time-based data), - doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.
This introduces
LeafCollector#collectRange
, which is typically useful to take advantage of the pre-aggregated data exposed inDocValuesSkipper
. At the moment,DocValuesSkipper
only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well.This
collectRange
method would be called if there are no deletions in the segment by:MatchAllDocsQuery
(with min=0 and max=maxDoc),PointRangeQuery
on segments that fully match the range (typical for time-based data),