Enhance message gap metric to include min/max/avg aggregations #17847

jtuglu-netflix · 2025-03-30T05:02:11Z

Description

Realtime ingest message gap metric additions

The current definition of ingest/events/messageGap remains as-is.
Adds ingest/events/minMessageGap, the minimum message gap seen in the currently-running task within the emission period.
Adds ingest/events/maxMessageGap, the maximum message gap seen in the currently-running task within the emission period.
Adds ingest/events/avgMessageGap, the avg message gap seen in the currently-running task across the entire duration thus far.

Appenderator Benchmarks

Adds Stream Appenderator benchmarks for the per-row Appenderator::add() method.

SegmentGenerationMetrics Benchmarks

Adds SegmentGenerationMetrics benchmarks for the expensive reporting methods.

Small optimizations to `StreamAppenderator::add()`

Avoid repeated System.currentTimeMillis() by caching locally in a volatile updated by a background thread.
Switch to fixed-length array for persist reason list (~20-40ns per row savings).
Switch to fixed-size hydrant map allocation in persistAll() (note: this has been omitted since it seems to be slower on smaller sink #s).
The above result in what appears to be a 100-300us speedup per 10k iterations

Benchmarks

OS: Linux
Arch: x86
Memory: 240GB ram
CPU 32 physical core, 2.6 Mhz base frequency

Appenderator Benchmarks

NB: All benchmarks run with 15 warm up iterations of 10k rows each, followed by 15 iterations of 10k rows.

Performance Before

Benchmark                                    (NUM_ROWS)  Mode  Cnt     Score     Error  Units
StreamAppenderatorBenchmark.benchmarkAddRow       10000   avgt   15  5760.485 ± 50.048  us/op

Performance After

Benchmark                                    (NUM_ROWS)  (enableMessageGapAggStats)  Mode  Cnt     Score    Error  Units
StreamAppenderatorBenchmark.benchmarkAddRow       10000                       false  avgt   15  5461.427 ± 40.418  us/op
StreamAppenderatorBenchmark.benchmarkAddRow       10000                        true  avgt   15  5525.679 ± 32.416  us/op

Metrics Benchmarks

Benchmark                                                                       (enableMessageGapMetrics)  Mode  Cnt  Score   Error  Units
SegmentGenerationMetricsBenchmark.benchmarkMultipleReportMaxSegmentHandoffTime                       true  avgt    2  1.696          ns/op
SegmentGenerationMetricsBenchmark.benchmarkMultipleReportMaxSegmentHandoffTime                      false  avgt    2  1.686          ns/op
SegmentGenerationMetricsBenchmark.benchmarkMultipleReportMessageGap                                  true  avgt    2  8.829          ns/op
SegmentGenerationMetricsBenchmark.benchmarkMultipleReportMessageGap                                 false  avgt    2  8.822          ns/op

Notes

Kept the main() functions inside the benchmarks temporarily so they're easy for folks to run from IDE outright.
Will add docs for these new metrics once approved

Release note

Add min/max/avg message gap reporting metrics to realtime indexing jobs.

Key changed/added classes in this PR

benchmarks/src/test/java/org/apache/druid/benchmark/indexing/AppenderatorBenchmark.java
benchmarks/src/test/java/org/apache/druid/benchmark/indexing/StreamAppenderatorBenchmark.java
benchmarks/src/test/java/org/apache/druid/benchmark/indexing/metrics/SegmentGenerationMetricsBenchmark.java
indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskTuningConfig.java
indexing-service/src/test/java/org/apache/druid/indexing/common/TaskRealtimeMetricsMonitorTest.java
server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java
server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorConfig.java
server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java

This PR has:

benchmarks/src/test/java/org/apache/druid/benchmark/indexing/StreamAppenderatorBenchmark.java

benchmarks/src/test/java/org/apache/druid/benchmark/indexing/AppenderatorBenchmark.java

+  @Setup
+  public void setup() throws IOException
+  {
+    tempDir = File.createTempFile("druid-appenderator-benchmark", "tmp");


server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java

server/src/test/java/org/apache/druid/segment/realtime/SegmentGenerationMetricsTest.java

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java

...va/org/apache/druid/indexing/rabbitstream/supervisor/RabbitStreamSupervisorTuningConfig.java

+           ", maxBytesInMemory=" + getMaxBytesInMemoryOrDefault() +
+           ", skipBytesInMemoryOverheadCheck=" + isSkipBytesInMemoryOverheadCheck() +
+           ", intermediatePersistPeriod=" + getIntermediatePersistPeriod() +
+           ", maxPendingPersists=" + getMaxPendingPersists() +


maytasm · 2025-04-20T00:58:47Z

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java

+  public void reportMessageGap(final long messageGap)
+  {
+    final long numEvent = this.numMessageGap.incrementAndGet();
+    this.avgMessageGap.getAndUpdate(oldAvg -> oldAvg + ((messageGap - oldAvg) / numEvent));


Won't it be more efficient to just keep a running sum? Then only calculate the avg using numEvent when we get the average

Won't it be more efficient to just keep a running sum? Then only calculate the avg using numEvent when we get the average

In short, given the emission period is variable, we could overflow on the sum. I figured the simplest solution was taking the performance penalty of floating pt division. Other offline alternatives seemed a bit more complicated like sampling/caching the gap values in a ring buffer and moving the avg calculation to emission-time, etc.

I would agree with @maytasm on this.
Just keep a AtomicDouble totalMessageGap (to avoid the overflow problem)
and the total number of events.
Do not compute the average until it needs to be reported.

kfaraz · 2025-04-25T02:55:05Z

@maytasm , @jtuglu-netflix , thanks for the patience on this.
I will review the PR later today.

…-metric

kfaraz

Thanks for the changes, @jtuglu-netflix .

I feel there are several changes that are probably not needed here.

Some suggestions:

Move the benchmarks to a separate PR. Focus this PR only on the new metrics.
Do not add a new config to emit the metrics
Do not emit min message gap as it doesn't add any new value
Simplify computation of message gap

Please see the review comments for more details.

kfaraz · 2025-04-29T13:36:31Z

benchmarks/src/test/java/org/apache/druid/benchmark/indexing/StreamAppenderatorBenchmark.java

+
+@Warmup(iterations = 15)
+@Measurement(iterations = 15)
+public class StreamAppenderatorBenchmark extends AppenderatorBenchmark


@jtuglu-netflix , would you mind moving the benchmarks to a separate PR and keeping this PR solely for the new metrics?

I am assuming that the benchmarks don't have anything to do with these metrics in particular. Please correct me if I am missing something.

The new metrics added overhead to the loop. Adding the perf improvements + benchmarks remove this overhead, so I figured I'd keep them in the same PR. I'll remove.

Yeah, we can keep the SegmentGenerationMetricsBenchmark in this PR since it is related.

If we create a separate PR for the other benchmarks, we can review and merge that PR first.
Then we will be able to better evaluate the changes in this PR against the already added benchmarks.

It will also keep the reviews and the commit history more straightforward.

Let me know if that sounds good.

kfaraz · 2025-04-29T13:41:05Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java

@@ -819,13 +819,13 @@ private TaskStatus generateAndPublishSegments(
      final PartitionAnalysis partitionAnalysis
  ) throws IOException, InterruptedException
  {
-    final SegmentGenerationMetrics buildSegmentsSegmentGenerationMetrics = new SegmentGenerationMetrics();
+    final IndexTuningConfig tuningConfig = ingestionSchema.getTuningConfig();
+    final SegmentGenerationMetrics buildSegmentsSegmentGenerationMetrics = new SegmentGenerationMetrics(tuningConfig.getMessageGapAggStatsEnabled());


The message gap will only ever be emitted for streaming tasks. It should not be a part of IndexTuningConfig and we should always pass false to the SegmentGenerationMetrics here.

kfaraz · 2025-04-29T13:42:57Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java

@@ -1291,7 +1293,8 @@ public IndexTuningConfig(
        @JsonProperty("maxSavedParseExceptions") @Nullable Integer maxSavedParseExceptions,
        @JsonProperty("maxColumnsToMerge") @Nullable Integer maxColumnsToMerge,
        @JsonProperty("awaitSegmentAvailabilityTimeoutMillis") @Nullable Long awaitSegmentAvailabilityTimeoutMillis,
-        @JsonProperty("numPersistThreads") @Nullable Integer numPersistThreads
+        @JsonProperty("numPersistThreads") @Nullable Integer numPersistThreads,
+        @JsonProperty("messageGapAggStatsEnabled") @Nullable Boolean messageGapAggStatsEnabled


IndexTask and IndexTuningConfig should not have this field as it is relevant only for streaming tasks.

kfaraz · 2025-04-29T13:54:54Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java

+    // cache volatile locally so it's likely to be a register read later
+    final long systemTime = currTimeMs;
+    if (messageGapAggStats) {
+      metrics.reportMessageGap(systemTime - row.getTimestampFromEpoch());


The logic of computing the actual message gap should live inside SegmentGenerationMetrics itself. The code here should just pass the row timestamp.

kfaraz · 2025-04-29T13:54:55Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java

+      timeExecutor.scheduleAtFixedRate(
+          () -> currTimeMs = System.currentTimeMillis(),
+          0,
+          1,
+          TimeUnit.MILLISECONDS
+      );


This doesn't seem like the right approach. We shouldn't need a separate executor just to update the currTimeMs.

This was the simplest way to avoid the performance overhead of calling the time function in the main loop. On Linux, this adds a significant performance penalty (24ns/row/call). I indicated this in the PR description.

There are other cleaner ways to avoid invoking the System.currentTimeMillis() call.
They could be somewhat less accurate but good enough for the use case here.

Just initialize the start time once when the message gap is being reset in SegmentGenerationMetrics.
Maintain a Stopwatch.

When calling recordMessageLag, compute message gap as follows:

messageGap = startTimestamp + stopwatch.millisElapsed() - rowTimestamp

Let me know if that would work.

kfaraz · 2025-04-29T13:56:35Z

...service/src/main/java/org/apache/druid/indexing/common/stats/TaskRealtimeMetricsMonitor.java

+    // Best-effort way to ensure parity amongst emitted metrics
+    if (metrics.isMessageGapAggStatsEnabled()) {
+      if (minMessageGap != Long.MAX_VALUE) {
+        emitter.emit(builder.setMetric("ingest/events/minMessageGap", minMessageGap));


Does the min message gap really add any value?
I don't imagine any SLAs relying on the min message gap.
I think just average and max should be enough.

And since these are only two new metrics reported in every metric emission period, I think we should not need to add a new config. It is fine to emit new metrics as long as we are not changing values of any of the existing metrics.

And since these are only two new metrics reported in every metric emission period, I think we should not need to add a new config.

These new metrics add overhead to the loop, not sure if that's something people want to "opt-in" to.

If we use the impl suggested in this comment, the overhead would be negligible I feel.

kfaraz · 2025-04-29T13:59:38Z

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java

+  public void reportMessageGap(final long messageGap)
+  {
+    final long numEvent = this.numMessageGap.incrementAndGet();
+    this.avgMessageGap.getAndUpdate(oldAvg -> oldAvg + ((messageGap - oldAvg) / numEvent));


I would agree with @maytasm on this.
Just keep a AtomicDouble totalMessageGap (to avoid the overflow problem)
and the total number of events.
Do not compute the average until it needs to be reported.

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java

server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java

github-actions bot added the Area - Ingestion label Mar 30, 2025

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch 5 times, most recently from 9964029 to 8722e82 Compare March 31, 2025 06:06

github-advanced-security bot found potential problems Mar 31, 2025

View reviewed changes

benchmarks/src/test/java/org/apache/druid/benchmark/indexing/StreamAppenderatorBenchmark.java Fixed Show fixed Hide fixed

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch 4 times, most recently from bc90805 to 4e2633a Compare March 31, 2025 06:55

github-advanced-security bot found potential problems Mar 31, 2025

View reviewed changes

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch 12 times, most recently from 584a2a6 to 6c80ada Compare April 1, 2025 16:28

bsyk reviewed Apr 1, 2025

View reviewed changes

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java Outdated Show resolved Hide resolved

bsyk reviewed Apr 1, 2025

View reviewed changes

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java Outdated Show resolved Hide resolved

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch from 6c80ada to dcd0623 Compare April 3, 2025 20:51

github-advanced-security bot found potential problems Apr 3, 2025

View reviewed changes

bsyk reviewed Apr 3, 2025

View reviewed changes

server/src/main/java/org/apache/druid/segment/realtime/SegmentGenerationMetrics.java Outdated Show resolved Hide resolved

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch from dcd0623 to f88b828 Compare April 3, 2025 21:42

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch 4 times, most recently from f85c609 to ae85511 Compare April 14, 2025 23:09

github-advanced-security bot found potential problems Apr 14, 2025

View reviewed changes

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch 3 times, most recently from a06f92c to d4c8adc Compare April 15, 2025 06:35

jtuglu-netflix marked this pull request as ready for review April 15, 2025 06:36

jtuglu-netflix requested a review from bsyk April 15, 2025 18:54

Add message gap aggregate statistics for stream appenderators

b20ab94

jtuglu-netflix force-pushed the upgrade-message-gap-metric branch from d4c8adc to b20ab94 Compare April 16, 2025 05:50

maytasm approved these changes Apr 20, 2025

View reviewed changes

maytasm added Design Review Area - Metrics/Event Emitting and removed Area - Batch Ingestion Kubernetes Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 20, 2025

maytasm requested review from kfaraz and samarthjain and removed request for bsyk April 20, 2025 01:05

Merge branch 'master' into upgrade-message-gap-metric

ed1ae4c

github-actions bot added Area - Batch Ingestion Kubernetes Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 24, 2025

Merge remote-tracking branch 'origin/master' into upgrade-message-gap…

6a0a1a3

…-metric

kfaraz reviewed Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance message gap metric to include min/max/avg aggregations #17847

Enhance message gap metric to include min/max/avg aggregations #17847

jtuglu-netflix commented Mar 30, 2025 •

edited

Loading

Check warning

Check notice

maytasm Apr 20, 2025

jtuglu-netflix Apr 20, 2025

kfaraz Apr 29, 2025

kfaraz commented Apr 25, 2025

kfaraz left a comment

kfaraz Apr 29, 2025

jtuglu-netflix Apr 29, 2025

kfaraz Apr 29, 2025

kfaraz Apr 29, 2025

kfaraz Apr 29, 2025

kfaraz Apr 29, 2025

kfaraz Apr 29, 2025

jtuglu-netflix Apr 29, 2025

kfaraz Apr 29, 2025

kfaraz Apr 29, 2025

jtuglu-netflix Apr 29, 2025

kfaraz Apr 29, 2025 •

edited

Loading

kfaraz Apr 29, 2025

Enhance message gap metric to include min/max/avg aggregations #17847

Are you sure you want to change the base?

Enhance message gap metric to include min/max/avg aggregations #17847

Conversation

jtuglu-netflix commented Mar 30, 2025 • edited Loading

Description

Realtime ingest message gap metric additions

Appenderator Benchmarks

SegmentGenerationMetrics Benchmarks

Small optimizations to StreamAppenderator::add()

Benchmarks

Appenderator Benchmarks

Performance Before

Performance After

Metrics Benchmarks

Notes

Release note

Key changed/added classes in this PR

Check warning

Check notice

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Apr 25, 2025

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuglu-netflix commented Mar 30, 2025 •

edited

Loading

Small optimizations to `StreamAppenderator::add()`

kfaraz Apr 29, 2025 •

edited

Loading