-
Notifications
You must be signed in to change notification settings - Fork 720
Exec Resource Monitoring + OTEL Metrics support #8506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1124,3 +1144,112 @@ func (w *Worker) installCACerts(ctx context.Context, state *execState) error { | |||
|
|||
return nil | |||
} | |||
|
|||
func (w *Worker) runContainer(ctx context.Context, state *execState) (rerr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I refactored this step here because the cgroup monitoring needs close synchronization w/ the actual execution of the container process so that it can collect a final cgroup sample before runc.Delete
is called (which rm's the cgroup). Seemed like it was worth encapsulating this complexity into a setup func like we do for the rest of the executor.
-- ) RETURNING id; | ||
-- name: InsertMetric :one | ||
INSERT INTO metrics ( | ||
data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vito I saw that spans and logs kept individual rows for a lot of the data and was gonna do the same for metrics, but there was some annoyance around writing the code for ser/deser of protobuf enums so I just took the easy route and serialized the whole protobuf object with protojson
.
It seems like we could probably get away with the same thing for spans+logs too, which might simplify some of this code a bit. I'm happy to leave everything as is since it all works, just checking with you whether there was some longer-term plan around having these individual fields in sqlite vs. just persisting the whole protobuf obj as a blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works - let's see how it goes. I would leave the rest as-is for now, since we've tossed around ideas for exposing OTel data through the Dagger API somehow, and it'll be easier to do things like "subscribe to this span" with a proper schema.
9cf2e8b
to
ac3e333
Compare
core/integration/telemetry_test.go
Outdated
@@ -42,3 +43,27 @@ func (TelemetrySuite) TestInternalVertexes(ctx context.Context, t *testctx.T) { | |||
require.NotContains(t, logs.String(), "merge (") | |||
}) | |||
} | |||
|
|||
func (TelemetrySuite) TestMetrics(ctx context.Context, t *testctx.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is flaky in CI, I believe just due to the fact that all the metric plumbing is inherently time based and thus sometimes slower in some test runs.
Best option is probably to rebase on #8442 and use the new telemetry test setup there.
7c0e94c
to
9da716e
Compare
3d70079
to
bd6c4ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm I can't seem to get any readings out of it with this:
dagger-dev core container from --address busybox with-exec --args dd,if=/dev/random,of=foo,bs=1M,count=1234 with-exec --args sleep,10 stdout -vvv
Output:
✔ Container.withExec(args: ["dd", "if=/dev/random", "of=foo", "bs=1M", "count=1234"]): Container! 4.1s
┃ 1234+0 records in
┃ 1234+0 records out
┃ 1293942784 bytes (1.2GB) copied, 3.565858 seconds, 346.1MB/s
✔ load cache: creating dagger metadata 0.0s
✔ exec dd if=/dev/random of=foo bs=1M count=1234 4.1s
✔ Container.withExec(args: ["sleep", "10"]): Container! 14.2s
✔ exec sleep 10 10.1s
✔ Container.stdout: String! 14.2s
Approving anyway on the assumption I missed something.
} | ||
if spans, logs, ok := enginetel.ConfiguredCloudExporters(ctx); ok { | ||
telemetryCfg.LiveTraceExporters = append(telemetryCfg.LiveTraceExporters, spans) | ||
telemetryCfg.LiveLogExporters = append(telemetryCfg.LiveLogExporters, logs) | ||
// TODO: metrics to cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - yep, need to get a schema and API set up on that side first
sdk/go/telemetry/attrs.go
Outdated
// OTel metric attribute so we can correlate metrics with spans | ||
MetricsSpanID = "dagger.io/metrics.span" | ||
|
||
// OTel metric attribute so we can correlate metrics with traces | ||
MetricsTraceID = "dagger.io/metrics.trace" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, for consistency:
// OTel metric attribute so we can correlate metrics with spans | |
MetricsSpanID = "dagger.io/metrics.span" | |
// OTel metric attribute so we can correlate metrics with traces | |
MetricsTraceID = "dagger.io/metrics.trace" | |
// OTel metric attribute so we can correlate metrics with spans | |
MetricsSpanIDAttr = "dagger.io/metrics.span" | |
// OTel metric attribute so we can correlate metrics with traces | |
MetricsTraceIDAttr = "dagger.io/metrics.trace" |
The main benefit of this is to free up the un-Attr
-suffixed name for a helper that sets the attribute, e.g. telemetry.MetricsTraceID(traceID)
sdk/go/telemetry/attrs.go
Outdated
// OTel metric for number of bytes read from disk by a container, as parsed from its cgroup | ||
IOStatDiskReadBytes = "dagger.io/metrics.iostat.disk.readbytes" | ||
|
||
// OTel metric for number of bytes written to disk by a container, as parsed from its cgroup | ||
IOStatDiskWriteBytes = "dagger.io/metrics.iostat.disk.writebytes" | ||
|
||
// OTel metric for number of microseconds SOME tasks in a cgroup were stalled on IO | ||
IOStatPressureSomeTotal = "dagger.io/metrics.iostat.pressure.some.total" | ||
|
||
// OTel metric units should be in UCUM format | ||
// https://unitsofmeasure.org/ucum | ||
|
||
// Bytes unit for OTel metrics | ||
ByteUnitName = "byte" | ||
|
||
// Microseconds unit for OTel metrics | ||
MicrosecondUnitName = "us" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mega nit: these aren't attributes, so it's a little jarring seeing them mixed in. If it's convenient to have them in the same file, we could rename this file to consts.go
and group them separately? Or maybe a metrics.go
would be nice for more targeted file switching? 🤷♂️ Just picking nits. Feel free to merge regardless, can tidy up async
sdk/go/telemetry/transform.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪠
@@ -279,6 +282,7 @@ func (r renderer) renderSpan( | |||
// TODO: when a span has child spans that have progress, do 2-d progress | |||
// fe.renderVertexTasks(out, span, depth) | |||
r.renderDuration(out, span) | |||
r.renderMetrics(out, span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these show up by default for everyone now? Wondering if we should tie it to a verbosity level. 🤔
edit: oh I guess since it's tied to the internal Buildkit spans they already won't see anything until cranking up the verbosity level, so 👍
The secret incantation is:
Diff being
Right now the metrics only show up once they are non-zero, so in your case the metrics for that exec were all 0 and thus never showed up. |
Signed-off-by: Erik Sipsma <[email protected]>
To avoid making newer clients incompatible w/ older engines, we can just skip exporting metrics if the engine doesn't know about them. Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
This PR adds support for OTel metrics to the engine+CLI (TUI specifically) and publishes some initial metrics for exec-ops that's parsed from their cgroup.
Metrics are currently only shown in the TUI on
-vvv
verbosity or above.Example of building the engine (metrics show up in green after exec has been running for at least 3 seconds or has finished):

OTel Metrics Primer
Braindump of what I've learned about OTel metrics:
Meter
, which you get from aMeterProvider
(described more below).Meter
is associated with a particularResource
and lets you create all sorts of differentInstrument
kinds for recording different types of data (link to different kinds)Int64Gauge
since it fits the collected data so far, but we'll likely need to use more in time.Meter
s have an associatedAggregation
andTemporality
that controls whether/how data points are aggregated/summarized before continuing down the pipeline.Int64Gauge
means we just publish the most recent value seen in a given sampling interval (more on that below).Aggregation
/Temporality
in the engineAggregation
/Temporality
in the CLIExporter
, which is the same as the rest of OTel: it's an interface to a place you can push metrics (OTLP, etc.)Meter
and anExporter
: aReader
andMeterProvider
.MeterProvider
is created directly from aReader
and as we currently use it is boilerplate for adapting the two interfaces. It lets you createMeter
s that are hooked up to theReader
View
s, which get associated with theMeterProvider
and allow you to customize the metrics stream created byMeter
s/Instrument
s created from theMeterProvider
.Reader
is somewhat more meaningful. There's two implementation of it:ManualReader
: metrics will be collected and returned to the user by calling theCollect
method on theReader
. The user then needs to publish them to anExporter
(or whatever)PeriodicReader
: metrics will be automatically collected on an interval and published to a givenExporter
in the backgroundPeriodicReader
everywhere applicable since it's simplest. Wouldn't be surprised if we want aManualReader
with custom logic in the future (i.e. to optimize by exporting based on both time+buffer size)Useful links:
Cgroup based metrics
I ended up going with our own custom implementation of cgroup monitoring rather than re-using upstream's code. This allowed us to simplify and specialize it for OTel metrics integration more easily.
Monitoring the cgroup amounts to just periodically parsing some files under
/sys/fs/cgroup
for the exec and publishing them on some OTel metricInstrument
s.I started with Disk bytes read/written and IO pressure because:
Int64Gauge
s that don't require deltas, aggregation, etc.Adding support for more exec metrics will mostly be a matter of parsing more files.
Misc notes:
sync
in the exec and don't use direct io for reading.Basically working, just doing cleanup
TODO (mostly notes for self):
// TODO:
s-vvv
now which seems reasonable to starttransform.go
(there was a lot of copypasta from internal otel packages, may have duplicated something existing on accident)