Fix HistogramCollector to not create zero-count buckets. #14421

jpountz · 2025-03-28T20:30:16Z

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

gsmiller

Good catch!

jainankitk · 2025-03-29T01:23:53Z

...src/test/org/apache/lucene/sandbox/facet/plain/histograms/TestHistogramCollectorManager.java

+            .add(NumericDocValuesField.newSlowRangeQuery("f", Long.MIN_VALUE, 2), Occur.SHOULD)
+            .add(NumericDocValuesField.newSlowRangeQuery("f", 10, Long.MAX_VALUE), Occur.SHOULD)
+            .build();
+    actualCounts = searcher.search(query, new HistogramCollectorManager("f", 4));


Initially, I was wondering if checkMaxBuckets(collectorCounts.size(), maxBuckets) might cause some of the tests earlier expecting exception to fail. But, all the tests with counts[i] == 0, use 1024 as maxBuckets which is comfortably over collectorCounts.size()

Maybe, we can specify 3 as maxBuckets for even stronger condition:

searcher.search(query, new HistogramCollectorManager("f", 4, 3));

jainankitk · 2025-03-29T01:23:59Z

...ne/sandbox/src/java/org/apache/lucene/sandbox/facet/plain/histograms/HistogramCollector.java

-        collectorCounts.addTo(leafMinBucket + i, counts[i]);
+        if (counts[i] != 0) {
+          collectorCounts.addTo(leafMinBucket + i, counts[i]);
+        }
      }
      checkMaxBuckets(collectorCounts.size(), maxBuckets);


While unrelated to this change, I am wondering if we should check eagerly to prevent unnecessary iterations:

if (counts[i] != 0) { collectorCounts.addTo(leafMinBucket + i, counts[i]); checkMaxBuckets(collectorCounts.size(), maxBuckets); }

We are doing similar validation in other places:

if (bucket != prevBucket) { counts.addTo(bucket, 1); checkMaxBuckets(counts.size(), maxBuckets); prevBucket = bucket; }

I think it's ok the way it is. The end goal is to prevent unbounded heap allocation. In this case, the amount of excess heap we may allocate is bounded by 1024 entries, so I'd err on the side of simplicity by not checking the number of buckets in the loop?

Sounds fair. I was expecting low bound like 1024, just wanted to confirm!

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

jpountz · 2025-03-31T14:43:04Z

I backported to branch_10_2 since this is a bugfix cc @iverase

Fix HistogramCollector to not create zero-count buckets.

e5adf3c

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

jpountz added this to the 10.2.0 milestone Mar 28, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 28, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 28, 2025

jpountz added the type:bug label Mar 28, 2025

github-actions bot added the module:sandbox label Mar 28, 2025

gsmiller approved these changes Mar 29, 2025

View reviewed changes

jainankitk approved these changes Mar 29, 2025

View reviewed changes

jpountz added 3 commits March 31, 2025 12:23

Fix HistogramCollector to not create zero-count buckets.

2df7e1f

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

Merge branch 'main' into dont_create_zero_count_buckets

29abfc2

Review feedback

54f24ff

gf2121 mentioned this pull request Mar 31, 2025

Enable collectors to take advantage of pre-aggregated data. #14401

Merged

jpountz merged commit 076f4e4 into apache:main Mar 31, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Mar 31, 2025

jpountz deleted the dont_create_zero_count_buckets branch March 31, 2025 14:41

jpountz added a commit that referenced this pull request Mar 31, 2025

Fix HistogramCollector to not create zero-count buckets. (#14421)

c321050

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

jpountz added a commit that referenced this pull request Mar 31, 2025

Fix HistogramCollector to not create zero-count buckets. (#14421)

2386f64

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HistogramCollector to not create zero-count buckets. #14421

Fix HistogramCollector to not create zero-count buckets. #14421

Uh oh!

jpountz commented Mar 28, 2025

Uh oh!

gsmiller left a comment

Uh oh!

jainankitk Mar 29, 2025

Uh oh!

jainankitk Mar 29, 2025

Uh oh!

jpountz Mar 29, 2025

Uh oh!

jainankitk Mar 31, 2025

Uh oh!

Uh oh!

jpountz commented Mar 31, 2025

Uh oh!

Uh oh!

Fix HistogramCollector to not create zero-count buckets. #14421

Fix HistogramCollector to not create zero-count buckets. #14421

Uh oh!

Conversation

jpountz commented Mar 28, 2025

Uh oh!

gsmiller left a comment

Choose a reason for hiding this comment

Uh oh!

jainankitk Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

jainankitk Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

jainankitk Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpountz commented Mar 31, 2025

Uh oh!

Uh oh!