enhancement(datadog_metrics sink): rewrite to the new model + add sketch support #9178

tobz · 2021-09-16T14:18:18Z

This PR is an amalgamation of a few key desired outcomes that are all inextricably tied together:

the desire to send aggregated histograms to Datadog (used by our internal telemetry, and sources like Prometheus)
the desire to rewrite the sink in the new style (performance, consistency, etc)
the desire to support metric sketches, specifically DDSketch, to allow for upcoming work around acting as a forward for all Datadog Agent traffic, representing logs, traces, and metrics in a lossless fashion

All of that, and more, is occurring in this PR. I've listed the changes below in a section-by-section format as the changes are many.

Refactored to the new model for sinks

This one is self-explanatory. Prior to this PR, the sink was written in the old style, using BatchedHttpSink and all of that old machinery. No longer. We've also reworked a bit of the module/import path stuff to better extricate the Datadog sinks to their own module paths, and feature flags, which matches other components.

Implemented a version of DDSketch based on Datadog Agent

As part of wanting to make sure we faithfully represent the data sent to us by the Datadog Agent, we wanted to make sure our DDSketch implementation would interoperate. As there is no official (yet!) Rust crate for DDSketch that mirrors the specific implementation details that the implementation in Datadog Agent (namely, quirks specific to Go and the fact that it diverged from the open-source versions), I decided to port it over in its entirety.

It mirrors the API which was designed primarily for batch insertion, with support for interpolating histogram buckets. This is what unlocks being able to use it to convert an aggregated histogram into a sketch. Additionally, we have fairly comprehensive unit tests to assert it matches the behavior of the Datadog Agent, such that any sketch coming from the agent through Vector should be emitted (so long as nothing modifies it in the pipeline, obviously) identically.

I had a lot of "fun" implementing this. Learning about the nitty gritty of Go's operator precedence as I ported over math.RoundToEven was a particular high point. I assume you'll read all the code comments I left and see just how much "fun" I had. :)

Added support for sketches natively in `Metric`

Since we want to be able to shuttle around sketches natively and in a lossless fashion, we have to be able to represent them internally as such... which necessitated adding a new variant to MetricValue to support them. Currently, we only support the AgentDDSketch variant, but this may be expanded or changed in the future to support more sketch types, or to support them generically by specifying the layout of the sketch.

OpenTelemetry takes an approach like this as all sketches are effectively representable as histograms with variable-boundary buckets and a set of values that define the bucket layout. This would be simpler than explicitly caring about one sketch implementation vs another, but means having to add enough support to be able to interoperate between the various options. We'll pay attention to how OpenTelemetry proceeds here so that we can better align ourselves in the future.

Add support for sending aggregated histograms to Datadog

After doing those the above two things, we can now send aggregated histograms to Datadog by virtue of converting them to sketches first. We additionally added logic to convert distributions to sketches, as this is what the agent itself does. There's no "sketch" type at the agent level, since it uses them for distributions... but since we might be shuttling these native data representations internally, we have both a distribution and sketch type.

... and now some additional bits of information that aren't terribly germane to the desired outcomes:

New `IncrementalRequestBuilder` for more flexible request building

While RequestBuilder provides a simple interface for defining how to build requests from a batch of events, so long as it's configured with a codec and compression type, what it does not easily allow for is defining how to add constraints around request building, such as ensuring a given payload does not exceed a certain number of bytes. Datadog specifically has limits around request payload sizes when interacting with its API, which the Datadog Agent handles by doing some advanced request building, incrementally encoding individual series/sketches until it hits the limit... and then continuing on with building a new request, so on and so forth, until all inputs are encoded as a number of requests.

We have done roughly the same thing here. The builder itself handles more of the heavy lifting, essentially being given all of the events in encode_events_incrementally, and expected to give back a vector of payloads that can be fed one-by-one to the build_request method.

We've written a tailored encoder for this PR (DatadogMetricsEncoder) that is designed around incremental encoding. As the process for this sink itself is not able to be done entirely in an incremental fashion (yet), I wasn't able to raise the builder and encoder to a level where they could be reused by other components, nor provide a default implementation that handled the boilerplate. This work is left as a future improvement.

Made `RetryAction` support copy-on-write values

This one touches a few non-Datadog sinks, but the premise is simple: it used to want owned String values, but in many cases, values are static strings. There's no point in allocating if we don't have to, as retries themselves already do a ton of allocating that we want to get rid of.

Small change to `AggregatedSummary`: no more `upper_limit`

Summaries don't have buckets, so using the term "upper limit" does not make sense. They do have various quantiles, though, where something like the shorthand q does make sense. I went ahead and made that change, which involved a few changes and, namely, is a breaking change as it affects the output of metric_to_log. I added that to the upgrade guide.

Signed-off-by: Toby Lawrence [email protected]

netlify · 2021-09-16T14:18:25Z

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 7c13036

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/617c36aa3f8f2d0007628603

tobz · 2021-09-16T21:35:17Z

Note to self: right now the internal aggregated histograms we ship do not look right in the DD metrics UI. We should be sending absolute (monotonically increasing, at that) bucket values but currentl the bucket values fluctuate a bit, like they're not being interpreted as gauges correctly or the data itself actually is going up and down.

tobz · 2021-10-25T20:22:06Z

Cargo.toml

@@ -440,8 +444,8 @@ sources-apache_metrics = []
 sources-aws_ecs_metrics = []
 sources-aws_kinesis_firehose = ["base64", "infer", "sources-utils-tls", "warp", "codecs"]
 sources-aws_s3 = ["rusoto", "rusoto_s3", "rusoto_sqs", "semver", "uuid", "codecs", "zstd"]
-sources-datadog = ["snap", "sources-utils-tls", "warp", "sources-utils-http-error", "codecs"]
-sources-dnstap = ["base64", "data-encoding", "trust-dns-proto", "dnsmsg-parser", "tonic-build", "prost-build"]
+sources-datadog_agent = ["snap", "sources-utils-tls", "warp", "sources-utils-http-error", "protobuf-build", "codecs"]


Why add this? Because we'll need the proto to be compiled when we accept sketches from the agent, so just doing it now with the knowledge we're going to need it anyways.

lib/vector-core/src/event/metric.rs

tobz · 2021-10-25T20:27:38Z

lib/vector-core/src/event/metric.rs

                })
            }
+            MetricValue::Sketch { sketch } => {
+                let quantiles = [0.5, 0.75, 0.9, 0.99]


Same quantiles supported by Datadog when you enable percentile aggregation for a given distribution, which figured like a reasonable thing to follow.

lib/vector-core/src/metrics/handle.rs

src/internal_events/datadog_metrics.rs

src/sinks/datadog/metrics/config.rs

blt

This is on the whole excellent. I have mostly small nits and my request change is solely based on the property test comment.

Well done.

lib/vector-core/Cargo.toml

lib/vector-core/proto/event.proto

lib/vector-core/src/event/metric.rs

lib/vector-core/src/event/test/common.rs

lib/vector-core/src/metrics/ddsketch.rs

src/sinks/datadog/metrics/config.rs

…ams and summaries Signed-off-by: Toby Lawrence <[email protected]>

Signed-off-by: Toby Lawrence <[email protected]>

…adog-metrics-sink-electric-boogaloo

Signed-off-by: Toby Lawrence <[email protected]>

…tch support (#9178) * enhancement(datadog_metrics sink): massive rewrite of the sink some of the included changes: - support for aggregated summaries (quantiles as discrete series) - support for aggregated histograms (via conversion to sketches) - native sketch support in Vector! and for sending to DD! - sink rewritten in the "new" style - many other core changes to support the work: metric-specific sink builder extension methods, incremental request builder, etc

## Context When support was added for encoding/sending sketches in #9178, logic was added to handle "splitting" payloads if a metric exceeded the (un)compressed payload limits. As we lacked (at the time) the ability to encode sketch metrics one-by-one, we were forced to collect all of them, and then attempt to encode them all at once, which had a tendency to grow the response size past the (un)compressed payload limits. This "splitting" mechanism allowed us to compensate for that. However, in order to avoid getting stuck in pathological loops where payloads were too big, and thus required multiple splits (after already attempting at least one split), the logic was configured such that a batch of metrics would only be split once, and if the two subsequent slices couldn't be encoded without also exceeding the limits, they would be dropped and we would give up trying to split further. Despite the gut feeling during that work that it should be exceedingly rare to ever need to split further, real life has shown otherwise: #13175 ## Solution This PR introduces proper incremental encoding of sketches, which doesn't eliminate the possibility of needing to split (more below) but significantly reduces the likelihood that splitting will need to happen down to a purely theoretical level. We're taking advantage of hidden-from-docs methods in `prost` to encode each `SketchPayload` object and append the bytes into a single buffer. This is possible due to how Protocol Buffers functions. Additionally, we're now generating "file descriptors" for our compiled Protocol Buffers definitions. We use this to let us programmatically query the field number of the "sketches" field in the `SketchPayload` message, which is a slightly more robust way than just hardcoding it and hoping it doesn't ever change in the future. In Protocol Buffers, each field in a message is written out such that the field data is preceded by the field number. This is part and parcel to its ability to allow for backwards compatible changes to a definition. Further, for repeated fields -- i.e. `Vec<Sketch>` -- the repetitive nature is determined simply by write the same field multiple times rather than needing to write everything all together. Practically speaking, this means that we can encode a vector of two messages, or encode those two messages individually, and end up with the same encoded output of `[field N][field data][field N][field data]`. ### Ancillary changes We've additionally fixed a bug with the "bytes sent" metric being reported for the `datadog_metrics` sink due to some very tangled and miswired code around how compressed/uncompressed/event bytes/etc sizes were being shuttled from the request builder logic down to `Driver`. We've also reworked some of the encoder error types just to clean them up and simplify things a bit. ## Reviewer notes ### Still needing to handle splits The encoder still does need to care about splits, in a theoretical sense, because while we can accurately track and avoid ever exceeding the uncompressed payload limit, we can't know the final compressed payload size until we finalize the builder/payload. Currently, the encoder does a check to see if adding the current metric would cause us to exceed the compressed payload limit, assuming the compressor couldn't actually compress the encoded metric at all. This is a fairly robust check since it tries to optimally account for the overhead of an entirely incompressible payload, and so on... but we really want to avoid dropping events if possible, obviously, and that's why the splitting code is still in place.

bruceg requested review from bruceg, blt and jszwedko September 16, 2021 15:24

tobz force-pushed the tobz/datadog-metrics-sink-electric-boogaloo branch from e5a18c6 to e55ee49 Compare October 15, 2021 19:48

tobz changed the title ~~enhancement(datadog_metrics sink): add support for aggregated histograms and summaries~~ enhancement(datadog_metrics sink): rewrite to the new model + add sketch support Oct 25, 2021