-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Order Metrics in the Buffer by Timestamp by Creating an Aggregator Plugin or Other Solution #12963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi,
This reads like you have newer metrics (i.e. showed up later in the buffer) with older timestamps. Is that accurate? Trying to understand how we could test this scenario.
When you say out of order, you are saying out of order according with respect to the timestamp on the metric?
This is what I would expect. Namely, newer data may get in because that data is ready to write does not need to go to the buffer. The data gets written to the output since the output is alive again, see here for cases that cause us to kick off a write. Then data in the buffer sits and waits, but because newer data was already written it gets stuck. I am not sure how this proposal prevents this situation. Let me know your thoughts! This issue has definety come up before, but the solution is still not clear to me. |
What I experienced was me installing Telegraf from brew, adding the [inputs.cpu] input plugin, standard boilerplate agent configuration, and the [outputs.stackdriver] output plugin, with Google auth configured via env variable. I let the agent run for a while and validated metrics were being sent into the Google API (in our case it was our Google Managed Prometheus project, but this seems to impact the Google Monitoring API as we use the same stackdriver plugin). Then I disconnected my network cable, let the buffer fill to about 100 and then plugged it back in. From there I started getting tons of errors in the logs complaining about metrics being out of order. I think you could perform the same test.
Timestamp. Though I'm not positive on that, I'm not sure how else the API could tell what order the metrics are supposed to be in. Yeah, not sure of the exact answer, but maybe instead of an aggregator it should be sorted by the plugin before it is sent. I guess the purpose of this was to make the stackdriver output plugin more functional as any scenario that leads to metrics being out of order (network interruption, input plugin flushing metrics out of order, etc) makes this plugin seem unreliable in a production scenario. |
I am going to need a way to reproduce this. I tried digging into this last night and this morning, thinking if it was a larger telegraf issue then I could reproduce with an exec/execd input and http output. From those experiments I always saw metrics ordered by oldest to newest even after an interruption to the output. I did set up stackdriver and used the following config to reproduce your CPU + unplugging the network cable example and it recovered as expected: [agent]
omit_hostname = true
debug = true
[[inputs.cpu]]
[[outputs.stackdriver]]
project = "flawless-span-322118"
namespace = "telegraf" and recovered just fine: ❯ make telegraf && ./telegraf --config config.toml
CGO_ENABLED=0 go build -tags "" -ldflags " -X github.com/influxdata/telegraf/internal.Commit=7088ec13 -X github.com/influxdata/telegraf/internal.Branch=master -X github.com/influxdata/telegraf/internal.Version=1.27.0-7088ec13" ./cmd/telegraf
2023-03-29T15:01:28Z I! Loading config file: config.toml
2023-03-29T15:01:28Z I! Starting Telegraf 1.27.0-7088ec13
2023-03-29T15:01:28Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-03-29T15:01:28Z I! Loaded inputs: cpu
2023-03-29T15:01:28Z I! Loaded aggregators:
2023-03-29T15:01:28Z I! Loaded processors:
2023-03-29T15:01:28Z I! Loaded secretstores:
2023-03-29T15:01:28Z I! Loaded outputs: stackdriver
2023-03-29T15:01:28Z I! Tags enabled:
2023-03-29T15:01:28Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2023-03-29T15:01:28Z D! [agent] Initializing plugins
2023-03-29T15:01:28Z D! [agent] Connecting outputs
2023-03-29T15:01:28Z D! [agent] Attempting connection to [outputs.stackdriver]
2023-03-29T15:01:28Z D! [agent] Successfully connected to outputs.stackdriver
2023-03-29T15:01:28Z D! [agent] Starting service inputs
2023-03-29T15:01:38Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:01:48Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 498.948597ms
2023-03-29T15:01:48Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:01:58Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 246.858905ms
2023-03-29T15:01:58Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:02:08Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 272.504395ms
2023-03-29T15:02:08Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:02:28Z W! [agent] ["outputs.stackdriver"] did not complete within its flush interval
2023-03-29T15:02:28Z D! [outputs.stackdriver] Buffer fullness: 66 / 10000 metrics
2023-03-29T15:02:30Z E! [outputs.stackdriver] Unable to write to Stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:02:30Z D! [outputs.stackdriver] Buffer fullness: 99 / 10000 metrics
2023-03-29T15:02:30Z E! [agent] Error writing to outputs.stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:02:48Z W! [agent] ["outputs.stackdriver"] did not complete within its flush interval
2023-03-29T15:02:48Z D! [outputs.stackdriver] Buffer fullness: 132 / 10000 metrics
2023-03-29T15:02:50Z E! [outputs.stackdriver] Unable to write to Stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:02:50Z D! [outputs.stackdriver] Buffer fullness: 165 / 10000 metrics
2023-03-29T15:02:50Z E! [agent] Error writing to outputs.stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:03:08Z W! [agent] ["outputs.stackdriver"] did not complete within its flush interval
2023-03-29T15:03:08Z D! [outputs.stackdriver] Buffer fullness: 198 / 10000 metrics
2023-03-29T15:03:10Z E! [outputs.stackdriver] Unable to write to Stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:03:10Z D! [outputs.stackdriver] Buffer fullness: 231 / 10000 metrics
2023-03-29T15:03:10Z E! [agent] Error writing to outputs.stackdriver: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2023-03-29T15:03:19Z D! [outputs.stackdriver] Wrote batch of 231 metrics in 1.612288182s
2023-03-29T15:03:19Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:03:28Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 259.043784ms
2023-03-29T15:03:28Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:03:38Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 245.288613ms
2023-03-29T15:03:38Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
^C2023-03-29T15:03:40Z D! [agent] Stopping service inputs
2023-03-29T15:03:40Z D! [agent] Input channel closed
2023-03-29T15:03:40Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-03-29T15:03:41Z D! [outputs.stackdriver] Wrote batch of 33 metrics in 245.228011ms
2023-03-29T15:03:41Z D! [outputs.stackdriver] Buffer fullness: 0 / 10000 metrics
2023-03-29T15:03:41Z I! [agent] Stopping running outputs
2023-03-29T15:03:41Z D! [agent] Stopped Successfully Are you certain you have a single telegraf writing to that bucket? Are there any stackdriver options I need to be aware of? |
Unfortunately, no. I have two differences from your config:
Scrolling to the end of the error message, this does not sound like out of order metrics, like the start of the error message says. It sounds like we are attempting to write more than one point to a time series in the same request. It makes some sense that you only see this when the buffer grows, because that is the time when you might have multiple series to write. This was briefly mentioned in #5404 when trying to reduce the number of requests from one request per metric to instead batching requests. In |
I've played with this some more this afternoon. It looks like the call to What I'd like to do is create a debug telegraf that prints out the time series and the values to see if we can spot any duplicates, like it says when you push. I'll try to get something up tomorrow. |
I put up PR #12994 with debug output. If you could try it on your mac where you are able to reproduce this easily and give me the full logs and output I'd like to look through it. I am not sure why circleci is not putting artifacts on PRs, but you can download the MacOS artifacts here: Thanks! |
Hi, This is the output when it sent metrics out using Stackdriver without any network interruption:
|
When I disconnected the ethernet cable providing network access, letting it buffer to 110 and then plugging the cable back in (pasting it without formatting as it will be a couple really long lines): 2023-03-30T21:12:59Z D! [outputs.stackdriver] Buffer fullness: 110 / 10000 metrics |
As a side note, I saw this error using the inputs.cpu plugin that I didn't get in the brew installed version of Telegraf:
Not super relevant to this issue, but thought I would let you know. |
This is because the brew folks build telegraf with CGO enabled. This pulls in the ability to monitor CPU usage on darwin/macOS systems. |
Can I get the entire logs, from start to finish? You can upload a file or even mail (jpowers at influxdata) it to me if you don't want to post it. |
Hi @powersj, I emailed you the logs. Thanks for the help! |
Here we go :) Write Failure 1At timestamp
Which then has a total of 130 messages similar to the following:
In the 130 additional messages, only the
Even with all of these errors, there are 70 data points writing successfully. Duplicate TimeSeriesWhat is a duplicate TimeSeries in Stackdriver? From what I gather it is based on the metric type and labels. Looking at the 2nd timeseries telegraf generated I see the type
If I look through the "sending time series:` debug output I see not one, but three of this same timeseries in the attempted send:
Looking through the errors I see:
Meaning the first data point was written, but then when Stackdriver got to the 2nd and 3rd, the writes fail. The only difference between these points is the endtime and possibly the value itself, but not in this case. Also note that the the time in seconds is newest for the first one, which brings us to.... Write Failure 2During the next flush attempt the error message starts off differently. The below error message is with additional whitespace changes from me to make it easier to read:
Then the message also contains another 131 messages about duplicate time series similar to the first write failure. Points must be written in orderFrom the first write failure telegraf already saw that points are not written oldest to newest. The payload telegraf sends is a mix of timestamps as well. Meaning later attempts can fail since newer data is present. To make matters worse, telegraf does not toss the points that wrote successfully. Instead all data points that were used stay in the buffer on a failure. Next StepsCurrently telegraf builds the timeseries, grabs up to 200 of them, and send them over in a request. This logic needs to get a bit smarter. My hypothesis is that telegraf needs to group the timeseries requests by endtime and sort oldest to newest. This would:
The duplication is avoided because if two metrics in telegraf have the same name, tags, and timestamp, then it is already the same metric. In the logs the only other difference was time, which telegraf is now grouping on. I tried sending duplicate points myself and saw the ouptut correctly consolidate two duplicate metrics down to one timeseries. The cons of this approach is that this will require additional requests, but I think that is the cost a user has to pay versus losing data or never sending data. The worst case would be getting individual metrics each with a unique timestamp. However, I think telegraf would still want to send them in different requests. How to do this, I'm not clear on quite yet. At first I saw that a timeseries take a slice of points, but the docs say it must be only one point. So that should still work. The other behavior I think we need to sort is what to do on errors. Because we already check for Thoughts? |
argh I forgot to comment here again... @crflanigan I pushed another commit to try to group time series by time on #12994 would you be willing to give that another shot? |
@powersj |
yes please! |
@powersj |
I will want the logs please - I'd like to go and see what happens with the change. Not entirely convinced this fixes the issue, but want to see how it behaves. |
Sure @powersj I will have to re-run it so it tee's to a file, but for now, it looks like a similar or same result:
|
Hi @powersj, |
@crflanigan I have updated the PR #12994 again, could you give that another try and let me know how it goes? If you see failures, getting all the logs as you have done before would be most helpful. Thanks! |
@powersj |
In a perfect world, metrics come in to stackdriver grouped by timestamp and all the timestamps are the same. This means that metrics go out together and a user never has to worry about out of order metrics. The world however is not perfect. In the event that the connection to stackdriver goes down, telegraf will start to save metrics to a buffer. Once reconnected, telegraf will send a batch size of data to stackdriver to send. Stackdriver would then ensure the metrics are sorted, but then break the metrics into timeseries. In doing so metrics would no longer be in any order and metrics could send duplicate time series. Meaning two idential metrics, but with different time stamps. What the user would see is first a message about duplicate timeseries and then an error about out of order metrics, where an older metric was trying to get added. This ensures that we avoid different timestamps by batching metrics by time. This way we avoid duplicate timeseries in the first place and ensure we always send oldest to newest. Fixes: influxdata#12963
Use Case
Certain data sinks require metrics to be emitted in order, such as Google Stackdriver. If even a single metric comes in that is newer than any of the others all metrics older than that newer metric will be rejected by the Stackdriver API. This is reiterated in the outputs.stackdriver plugin:
The Stackdriver API does not allow writing points which are out of order, older than 24 hours, or more with resolution greater than than one per point minute. Since Telegraf writes the newest points first and moves backwards through the metric buffer, it may not be possible to write historical data after an interruption.
There was a feature request that was supposed to help with this leveraging FIFO (first in first out) but the issue persists. After discussing more with the InfluxData developers over Slack I found that the buffer keeps the metrics in the order they are received and the PR changed that the first received metric is also first sent to the output, but that it doesn't account for timestamp. This means that if the metrics in the buffer are out of order for any reason all older metrics could be rejected by the Stackdriver API.
Scenario:
I configure the outputs.stackdriver plugin to send metrics to the Stackdriver API successfully. I then disconnect my ethernet cable from my network adaptor to simulate a network outage. Metrics begin to fill the buffer as expected. I then connect the cable again.
Expected behavior
When I reconnect the cable the buffer begins draining the metrics to the outputs.stackdriver which sends the metrics to the Stackdriver API without errors.
Actual behavior
When I reconnect the cable the buffer fails to drain, and actually grows, while spamming this message and others similar over and over again:
That said, some metrics do appear to still be coming in, but many aren't as the buffer continues to grow quickly.
Additional info
Proposed Solution:
To address this issue, an aggregator plugin can be developed that sorts metrics by their timestamps before sending them to the buffer. This will ensure that metrics are always in chronological order, preventing older metrics from being rejected by the Stackdriver API due to out-of-order timestamps. The aggregator plugin should perform the following steps:
This could also be resolved by changing the
outputs.stackdriver
output plugin to sort the metrics prior to being sent.The text was updated successfully, but these errors were encountered: