Enable collectors to take advantage of pre-aggregated data. #14401

jpountz · 2025-03-25T15:07:26Z

This introduces LeafCollector#collectRange, which is typically useful to take advantage of the pre-aggregated data exposed in DocValuesSkipper. At the moment, DocValuesSkipper only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well.

This collectRange method would be called if there are no deletions in the segment by:

queries that rewrite to a MatchAllDocsQuery (with min=0 and max=maxDoc),
PointRangeQuery on segments that fully match the range (typical for time-based data),
doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.

This introduces `LeafCollector#collectRange`, which is typically useful to take advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the moment, `DocValuesSkipper` only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well. This `collectRange` method would be called if there are no deletions in the segment by: - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc), - `PointRangeQuery` on segments that fully match the range (typical for time-based data), - doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.

jpountz · 2025-03-25T15:19:55Z

@epotyom You may be interested in this, this allows computing aggregates in sub-linear time respective to the number of matching docs.

gsmiller · 2025-03-27T16:39:59Z

It makes sense to me to expose the idea of doc range collection as a first-class API on leaf collectors for the reasons you outlined above. This would also benefit #14273 as well right? If there are scorers that can leverage the range collection call, it would immediately benefit I believe.

jpountz · 2025-03-27T21:15:55Z

This would also benefit #14273

I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with RangeDocIdStream?

For clarity, this collectRange method is more useful for aggregating values (facet "recorders" if I use the terminology of the new faceting framework?). Implementations would need to consult the DocValuesSkipper to check if it has pre-aggregated data over ranges of doc IDs that are contained in the range of doc IDs passed to collectRange. These sub ranges could be aggregated in constant time, without having to iterate over docs.

gf2121 · 2025-03-28T06:48:54Z

lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java

+  private final int min, max;
+
+  RangeDocIdStream(int min, int max) {
+    if (max < min) {


Do we intend to allow min==max, which is actually an empty range? We need to update this or the exception message anyway.

This is a good question, I had overlooked it. While I don't think that empty ranges would cause issues for any impl, I think we should reject them. I'll update the PR.

gf2121 · 2025-03-28T06:56:43Z

lucene/core/src/java/org/apache/lucene/search/LeafCollector.java

+   * <p>Extending this method is typically useful to take advantage of pre-aggregated data exposed
+   * in a {@link DocValuesSkipper}.
+   *
+   * <p>The default implementation calls {@link #collect(DocIdStream)} on a {@link DocIdStream} that


Maybe clarify if we have any guarantee on the given values like max > min ?

gsmiller · 2025-03-28T14:09:12Z

I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with RangeDocIdStream?

My thinking here was that HistogramCollector should benefit from any scorers that can provide it with a DocIdStream for collection, and that this change lays the groundwork for more scorers to pass streams to collectors instead of individual docs (specifically thinking about some of the query use-cases you mention in the description). I probably should have been more clear :) (and maybe I'm still getting confused and this isn't true...)

jpountz · 2025-03-28T15:10:44Z

Ah, that's right. We have a good number of queries that are already covered, in my opinion the next natural step is to look into making ranges collect ranges when any clause would collect a range.

jpountz · 2025-03-28T15:16:35Z

Any opinion on collect(int min, int max) vs. collectRange(int min, int max)? I leaned towards collectRange since we already have collect(int doc) and it wouldn't be obvious from the parameter types whether collect(int, int) is collecting a range or two random docs. No strong feeling either way though. collect(DocIdStream) is called "collect" rather than "collectDocIdStream" so I guess that collect(int min, int max) would be more consistent from this perspective.

gsmiller · 2025-03-28T17:04:51Z

I prefer collectRange as well to make usage a little less error-prone. I don't have a strong opinion though.

gf2121

LGTM

gf2121 · 2025-03-29T06:33:27Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java

+  @Override
+  public void collectRange(int min, int max) throws IOException {
+    assert min > lastCollected;
+    assert max > min;


Maybe assert min >= this.min and max <= this.max as well :)

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

gf2121

We also need a CHANGES entry.

gf2121 · 2025-03-31T11:14:41Z

The change of #14421 is also included, which seems not expected?

jpountz · 2025-03-31T14:45:49Z

It is unexpected indeed! I'll fix this and add a CHANGES entry.

This introduces `LeafCollector#collectRange`, which is typically useful to take advantage of the pre-aggregated data exposed in `DocValuesSkipper`. At the moment, `DocValuesSkipper` only exposes per-block min and max values, but we could easily extend it to record sums and value counts as well. This `collectRange` method would be called if there are no deletions in the segment by: - queries that rewrite to a `MatchAllDocsQuery` (with min=0 and max=maxDoc), - `PointRangeQuery` on segments that fully match the range (typical for time-based data), - doc-value range queries and conjunctions of doc-value range queries on fields that enable sparse indexing and correlate with the index sort.

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 25, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 25, 2025

github-actions bot added the module:core/search label Mar 25, 2025

gsmiller approved these changes Mar 27, 2025

View reviewed changes

gf2121 reviewed Mar 28, 2025

View reviewed changes

Merge branch 'main' into collect_pre_aggregated_data

dd962d1

jpountz added 2 commits March 28, 2025 21:17

Merge branch 'main' into collect_pre_aggregated_data

906a5a6

Review feedback

2012137

github-actions bot added the module:test-framework label Mar 28, 2025

gf2121 approved these changes Mar 29, 2025

View reviewed changes

jpountz added 3 commits March 31, 2025 12:23

Fix HistogramCollector to not create zero-count buckets.

2df7e1f

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

Merge branch 'main' into collect_pre_aggregated_data

f4233cf

Review feedback.

1e5e633

github-actions bot added the module:sandbox label Mar 31, 2025

gf2121 approved these changes Mar 31, 2025

View reviewed changes

jpountz added 2 commits March 31, 2025 16:46

Merge branch 'main' into collect_pre_aggregated_data

2c803d3

CHANGES

9ee35b1

github-actions bot removed the module:sandbox label Mar 31, 2025

jpountz merged commit 4bda52c into apache:main Mar 31, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Mar 31, 2025

jpountz deleted the collect_pre_aggregated_data branch March 31, 2025 15:32

Enable collectors to take advantage of pre-aggregated data. #14401

Enable collectors to take advantage of pre-aggregated data. #14401

Uh oh!

Conversation

jpountz commented Mar 25, 2025

Uh oh!

jpountz commented Mar 25, 2025

Uh oh!

gsmiller commented Mar 27, 2025

Uh oh!

jpountz commented Mar 27, 2025

Uh oh!

gf2121 Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

gf2121 Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

gsmiller commented Mar 28, 2025

Uh oh!

jpountz commented Mar 28, 2025

Uh oh!

jpountz commented Mar 28, 2025

Uh oh!

gsmiller commented Mar 28, 2025

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

gf2121 Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

gf2121 commented Mar 31, 2025

Uh oh!

jpountz commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!