Skip to content

Enable collectors to take advantage of pre-aggregated data. #14401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 31, 2025

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Mar 25, 2025

This introduces LeafCollector#collectRange, which is typically useful to take advantage of the pre-aggregated data exposed in DocValuesSkipper. At the moment, DocValuesSkipper only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well.

This collectRange method would be called if there are no deletions in the segment by:

  • queries that rewrite to a MatchAllDocsQuery (with min=0 and max=maxDoc),
  • PointRangeQuery on segments that fully match the range (typical for time-based data),
  • doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.

This introduces `LeafCollector#collectRange`, which is typically useful to take
advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the
moment, `DocValuesSkipper` only exposes per-block min and max values, but we
could easily extend it to record sums and value counts as well.

This `collectRange` method would be called if there are no deletions in the
segment by:
 - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc),
 - `PointRangeQuery` on segments that fully match the range (typical for
   time-based data),
 - doc-value range queries and conjunctions of doc-value range queries on
   fields that enable sparse indexing and correlate with the index sort.
@jpountz
Copy link
Contributor Author

jpountz commented Mar 25, 2025

@epotyom You may be interested in this, this allows computing aggregates in sub-linear time respective to the number of matching docs.

@gsmiller
Copy link
Contributor

It makes sense to me to expose the idea of doc range collection as a first-class API on leaf collectors for the reasons you outlined above. This would also benefit #14273 as well right? If there are scorers that can leverage the range collection call, it would immediately benefit I believe.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 27, 2025

This would also benefit #14273

I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with RangeDocIdStream?

For clarity, this collectRange method is more useful for aggregating values (facet "recorders" if I use the terminology of the new faceting framework?). Implementations would need to consult the DocValuesSkipper to check if it has pre-aggregated data over ranges of doc IDs that are contained in the range of doc IDs passed to collectRange. These sub ranges could be aggregated in constant time, without having to iterate over docs.

private final int min, max;

RangeDocIdStream(int min, int max) {
if (max < min) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to allow min==max, which is actually an empty range? We need to update this or the exception message anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question, I had overlooked it. While I don't think that empty ranges would cause issues for any impl, I think we should reject them. I'll update the PR.

* <p>Extending this method is typically useful to take advantage of pre-aggregated data exposed
* in a {@link DocValuesSkipper}.
*
* <p>The default implementation calls {@link #collect(DocIdStream)} on a {@link DocIdStream} that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify if we have any guarantee on the given values like max > min ?

@gsmiller
Copy link
Contributor

I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with RangeDocIdStream?

My thinking here was that HistogramCollector should benefit from any scorers that can provide it with a DocIdStream for collection, and that this change lays the groundwork for more scorers to pass streams to collectors instead of individual docs (specifically thinking about some of the query use-cases you mention in the description). I probably should have been more clear :) (and maybe I'm still getting confused and this isn't true...)

@jpountz
Copy link
Contributor Author

jpountz commented Mar 28, 2025

Ah, that's right. We have a good number of queries that are already covered, in my opinion the next natural step is to look into making ranges collect ranges when any clause would collect a range.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 28, 2025

Any opinion on collect(int min, int max) vs. collectRange(int min, int max)? I leaned towards collectRange since we already have collect(int doc) and it wouldn't be obvious from the parameter types whether collect(int, int) is collecting a range or two random docs. No strong feeling either way though. collect(DocIdStream) is called "collect" rather than "collectDocIdStream" so I guess that collect(int min, int max) would be more consistent from this perspective.

@gsmiller
Copy link
Contributor

I prefer collectRange as well to make usage a little less error-prone. I don't have a strong opinion though.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Override
public void collectRange(int min, int max) throws IOException {
assert min > lastCollected;
assert max > min;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert min >= this.min and max <= this.max as well :)

jpountz added 3 commits March 31, 2025 12:23
If a bucket in the middle of the range doesn't match docs, it would be returned
with a count of zero. Better not return it at all.
Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a CHANGES entry.

@gf2121
Copy link
Contributor

gf2121 commented Mar 31, 2025

The change of #14421 is also included, which seems not expected?

@jpountz
Copy link
Contributor Author

jpountz commented Mar 31, 2025

It is unexpected indeed! I'll fix this and add a CHANGES entry.

@jpountz jpountz merged commit 4bda52c into apache:main Mar 31, 2025
7 checks passed
@jpountz jpountz deleted the collect_pre_aggregated_data branch March 31, 2025 15:32
jpountz added a commit that referenced this pull request Mar 31, 2025
This introduces `LeafCollector#collectRange`, which is typically useful to take
advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the
moment, `DocValuesSkipper` only exposes per-block min and max values, but we
could easily extend it to record sums and value counts as well.

This `collectRange` method would be called if there are no deletions in the
segment by:
 - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc),
 - `PointRangeQuery` on segments that fully match the range (typical for
   time-based data),
 - doc-value range queries and conjunctions of doc-value range queries on
   fields that enable sparse indexing and correlate with the index sort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants