Exec Resource Monitoring + OTEL Metrics support #8506

sipsma · 2024-09-20T02:53:39Z

This PR adds support for OTel metrics to the engine+CLI (TUI specifically) and publishes some initial metrics for exec-ops that's parsed from their cgroup.

For now, the metrics are just total bytes read/written from/to the disk and IO pressure
More metrics can easily be added iteratively as desired.

Metrics are currently only shown in the TUI on -vvv verbosity or above.

Example of building the engine (metrics show up in green after exec has been running for at least 3 seconds or has finished):

OTel Metrics Primer

Braindump of what I've learned about OTel metrics:

To publish metrics, you need a Meter, which you get from a MeterProvider (described more below).
- A Meter is associated with a particular Resource and lets you create all sorts of different Instrument kinds for recording different types of data (link to different kinds)
- Currently, I only use an Int64Gauge since it fits the collected data so far, but we'll likely need to use more in time.
- Usage in code
  - Create a meter
  - Create an instrument from a meter
Meters have an associated Aggregation and Temporality that controls whether/how data points are aggregated/summarized before continuing down the pipeline.
- Currently, this PR just configures everything everywhere to use default Aggregation for instruments, which for an Int64Gauge means we just publish the most recent value seen in a given sampling interval (more on that below).
- Usage in code
  - Configuration of Aggregation/Temporality in the engine
  - Configuration of Aggregation/Temporality in the CLI
On the other side of the pipeline there's an Exporter, which is the same as the rest of OTel: it's an interface to a place you can push metrics (OTLP, etc.)
- Usage in code
  - Engine-side exporter to sqlite db
  - OTLP exporter setup
There's two components that connect together a Meter and an Exporter: a Reader and MeterProvider.
- A MeterProvider is created directly from a Reader and as we currently use it is boilerplate for adapting the two interfaces. It lets you create Meters that are hooked up to the Reader
- One thing I didn't use yet are Views, which get associated with the MeterProvider and allow you to customize the metrics stream created by Meters/Instruments created from the MeterProvider.
- The Reader is somewhat more meaningful. There's two implementation of it:
  - ManualReader: metrics will be collected and returned to the user by calling the Collect method on the Reader. The user then needs to publish them to an Exporter (or whatever)
  - PeriodicReader: metrics will be automatically collected on an interval and published to a given Exporter in the background
- For now, I've just used a PeriodicReader everywhere applicable since it's simplest. Wouldn't be surprised if we want a ManualReader with custom logic in the future (i.e. to optimize by exporting based on both time+buffer size)
- Usage in code:
  - engine-side metric provider + reader setup for each client

Useful links:

Cgroup based metrics

I ended up going with our own custom implementation of cgroup monitoring rather than re-using upstream's code. This allowed us to simplify and specialize it for OTel metrics integration more easily.

Monitoring the cgroup amounts to just periodically parsing some files under /sys/fs/cgroup for the exec and publishing them on some OTel metric Instruments.

The format of the files being parsed can be found in kernel docs here

I started with Disk bytes read/written and IO pressure because:

They are simple Int64Gauges that don't require deltas, aggregation, etc.
They are most immediately relevant to bottlenecks we are currently investigating

Adding support for more exec metrics will mostly be a matter of parsing more files.

One notable exception is network usage, which instead needs to be sampled from the CNI netns interface, but otherwise it should be the same idea.

Misc notes:

The metrics are currently sampled:
- Every 3 seconds while the exec is running
- A single time after the exec exits to get final values (or the only values if the exec lasted less than 3s)
I found that you will often get inconsistent/unexpectedly-low read/write disk bytes if you don't include a sync in the exec and don't use direct io for reading.
- This is because it's counting actual read/writes that hit the disk, not counting the kernels read/write cache in memory.
- This is fine, but cgroups do appear to also give info on read/write caches, so we could consider aggregating if needed

Basically working, just doing cleanup

TODO (mostly notes for self):

cmd/dagger/engine.go

sipsma · 2024-09-27T20:54:08Z

engine/buildkit/executor_spec.go

@@ -1124,3 +1144,112 @@ func (w *Worker) installCACerts(ctx context.Context, state *execState) error {

 	return nil
 }
+
+func (w *Worker) runContainer(ctx context.Context, state *execState) (rerr error) {


Note to reviewers: I refactored this step here because the cgroup monitoring needs close synchronization w/ the actual execution of the container process so that it can collect a final cgroup sample before runc.Delete is called (which rm's the cgroup). Seemed like it was worth encapsulating this complexity into a setup func like we do for the rest of the executor.

sipsma · 2024-09-27T20:58:10Z

engine/clientdb/queries.sql

-- ) RETURNING id;
+-- name: InsertMetric :one
+INSERT INTO metrics (
+    data


@vito I saw that spans and logs kept individual rows for a lot of the data and was gonna do the same for metrics, but there was some annoyance around writing the code for ser/deser of protobuf enums so I just took the easy route and serialized the whole protobuf object with protojson.

It seems like we could probably get away with the same thing for spans+logs too, which might simplify some of this code a bit. I'm happy to leave everything as is since it all works, just checking with you whether there was some longer-term plan around having these individual fields in sqlite vs. just persisting the whole protobuf obj as a blob.

That works - let's see how it goes. I would leave the rest as-is for now, since we've tossed around ideas for exposing OTel data through the Dagger API somehow, and it'll be easier to do things like "subscribe to this span" with a proper schema.

sipsma · 2024-10-11T20:08:17Z

core/integration/telemetry_test.go

@@ -42,3 +43,27 @@ func (TelemetrySuite) TestInternalVertexes(ctx context.Context, t *testctx.T) {
 		require.NotContains(t, logs.String(), "merge (")
 	})
 }
+
+func (TelemetrySuite) TestMetrics(ctx context.Context, t *testctx.T) {


This test is flaky in CI, I believe just due to the fact that all the metric plumbing is inherently time based and thus sometimes slower in some test runs.

Best option is probably to rebase on #8442 and use the new telemetry test setup there.

vito

Hrm I can't seem to get any readings out of it with this:

dagger-dev core container from --address busybox with-exec --args dd,if=/dev/random,of=foo,bs=1M,count=1234 with-exec --args sleep,10 stdout -vvv

Output:

✔ Container.withExec(args: ["dd", "if=/dev/random", "of=foo", "bs=1M", "count=1234"]): Container! 4.1s
┃ 1234+0 records in
┃ 1234+0 records out
┃ 1293942784 bytes (1.2GB) copied, 3.565858 seconds, 346.1MB/s
  ✔ load cache: creating dagger metadata 0.0s
  ✔ exec dd if=/dev/random of=foo bs=1M count=1234 4.1s
✔ Container.withExec(args: ["sleep", "10"]): Container! 14.2s
  ✔ exec sleep 10 10.1s
✔ Container.stdout: String! 14.2s

Approving anyway on the assumption I missed something.

vito · 2024-10-16T15:45:59Z

cmd/dagger/engine.go

 	}
 	if spans, logs, ok := enginetel.ConfiguredCloudExporters(ctx); ok {
 		telemetryCfg.LiveTraceExporters = append(telemetryCfg.LiveTraceExporters, spans)
 		telemetryCfg.LiveLogExporters = append(telemetryCfg.LiveLogExporters, logs)
+		// TODO: metrics to cloud


👍 - yep, need to get a schema and API set up on that side first

vito · 2024-10-16T15:50:18Z

sdk/go/telemetry/attrs.go

+	// OTel metric attribute so we can correlate metrics with spans
+	MetricsSpanID = "dagger.io/metrics.span"
+
+	// OTel metric attribute so we can correlate metrics with traces
+	MetricsTraceID = "dagger.io/metrics.trace"


nit, for consistency:

Suggested change

// OTel metric attribute so we can correlate metrics with spans

MetricsSpanID = "dagger.io/metrics.span"

// OTel metric attribute so we can correlate metrics with traces

MetricsTraceID = "dagger.io/metrics.trace"

// OTel metric attribute so we can correlate metrics with spans

MetricsSpanIDAttr = "dagger.io/metrics.span"

// OTel metric attribute so we can correlate metrics with traces

MetricsTraceIDAttr = "dagger.io/metrics.trace"

The main benefit of this is to free up the un-Attr-suffixed name for a helper that sets the attribute, e.g. telemetry.MetricsTraceID(traceID)

vito · 2024-10-16T15:52:07Z

sdk/go/telemetry/attrs.go

+	// OTel metric for number of bytes read from disk by a container, as parsed from its cgroup
+	IOStatDiskReadBytes = "dagger.io/metrics.iostat.disk.readbytes"
+
+	// OTel metric for number of bytes written to disk by a container, as parsed from its cgroup
+	IOStatDiskWriteBytes = "dagger.io/metrics.iostat.disk.writebytes"
+
+	// OTel metric for number of microseconds SOME tasks in a cgroup were stalled on IO
+	IOStatPressureSomeTotal = "dagger.io/metrics.iostat.pressure.some.total"
+
+	// OTel metric units should be in UCUM format
+	// https://unitsofmeasure.org/ucum
+
+	// Bytes unit for OTel metrics
+	ByteUnitName = "byte"
+
+	// Microseconds unit for OTel metrics
+	MicrosecondUnitName = "us"


mega nit: these aren't attributes, so it's a little jarring seeing them mixed in. If it's convenient to have them in the same file, we could rename this file to consts.go and group them separately? Or maybe a metrics.go would be nice for more targeted file switching? 🤷‍♂️ Just picking nits. Feel free to merge regardless, can tidy up async

vito · 2024-10-16T16:53:35Z

sdk/go/telemetry/transform.go

vito · 2024-10-16T17:00:24Z

dagql/idtui/frontend.go

@@ -279,6 +282,7 @@ func (r renderer) renderSpan(
 		// TODO: when a span has child spans that have progress, do 2-d progress
 		// fe.renderVertexTasks(out, span, depth)
 		r.renderDuration(out, span)
+		r.renderMetrics(out, span)


Will these show up by default for everyone now? Wondering if we should tie it to a verbosity level. 🤔

edit: oh I guess since it's tied to the internal Buildkit spans they already won't see anything until cranking up the verbosity level, so 👍

sipsma · 2024-10-16T22:26:58Z

@vito

The secret incantation is:

dagger-dev core container from --address busybox with-exec --args dd,if=/dev/random,of=foo,bs=1M,count=1234,oflag=direct with-exec --args sleep,10 stdout -vvv

Diff being oflag=direct. I hit this too and believe (like 90% sure) what's happening is the kernel read/write caches are getting hit instead of the actual disk. oflag=direct enables directs io so you skip the caches.

Also worth noting that if it's cached you get no data (as opposed to cached metrics from previous run), so need a cache bust

Right now the metrics only show up once they are non-zero, so in your case the metrics for that exec were all 0 and thus never showed up.

Signed-off-by: Erik Sipsma <[email protected]>

To avoid making newer clients incompatible w/ older engines, we can just skip exporting metrics if the engine doesn't know about them. Signed-off-by: Erik Sipsma <[email protected]>

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the perfmon branch from bd3ba3c to 0587d88 Compare September 27, 2024 20:40

sipsma commented Sep 27, 2024

View reviewed changes

sipsma force-pushed the perfmon branch 2 times, most recently from 9cf2e8b to ac3e333 Compare October 9, 2024 21:47

sipsma force-pushed the perfmon branch from 72f2f90 to 24ac0e5 Compare October 11, 2024 00:24

sipsma commented Oct 11, 2024

View reviewed changes

sipsma force-pushed the perfmon branch 3 times, most recently from 7c0e94c to 9da716e Compare October 16, 2024 00:23

sipsma marked this pull request as ready for review October 16, 2024 00:26

sipsma requested a review from a team as a code owner October 16, 2024 00:26

sipsma requested review from jedevc and vito October 16, 2024 00:26

sipsma force-pushed the perfmon branch 2 times, most recently from 3d70079 to bd6c4ba Compare October 16, 2024 00:45

vito approved these changes Oct 16, 2024

View reviewed changes

sipsma added 3 commits October 17, 2024 16:39

add support for OTel metrics and cgroup monitoring

475ab9d

Signed-off-by: Erik Sipsma <[email protected]>

don't fail client if metrics aren't found

f375996

To avoid making newer clients incompatible w/ older engines, we can just skip exporting metrics if the engine doesn't know about them. Signed-off-by: Erik Sipsma <[email protected]>

remove flaky metrics test for now

24e14ce

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the perfmon branch from e152250 to ef8a5de Compare October 17, 2024 23:43

sipsma added 2 commits October 17, 2024 17:02

address vito feedback

79af415

Signed-off-by: Erik Sipsma <[email protected]>

add changelog

24fa72f

Signed-off-by: Erik Sipsma <[email protected]>

sipsma force-pushed the perfmon branch from 6ed6cea to 24fa72f Compare October 18, 2024 00:02

sipsma merged commit 1f9d373 into dagger:main Oct 18, 2024
57 checks passed

sipsma added this to the v0.13.6 milestone Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exec Resource Monitoring + OTEL Metrics support #8506

Exec Resource Monitoring + OTEL Metrics support #8506

Uh oh!

sipsma commented Sep 20, 2024 •

edited

Loading

Uh oh!

Uh oh!

sipsma Sep 27, 2024

Uh oh!

sipsma Sep 27, 2024

Uh oh!

vito Oct 16, 2024

Uh oh!

sipsma Oct 11, 2024

Uh oh!

vito left a comment

Uh oh!

vito Oct 16, 2024

Uh oh!

vito Oct 16, 2024

Uh oh!

vito Oct 16, 2024

Uh oh!

vito Oct 16, 2024

Uh oh!

vito Oct 16, 2024

Uh oh!

sipsma commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

Exec Resource Monitoring + OTEL Metrics support #8506

Exec Resource Monitoring + OTEL Metrics support #8506

Uh oh!

Conversation

sipsma commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OTel Metrics Primer

Cgroup based metrics

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sipsma commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

sipsma commented Sep 20, 2024 •

edited

Loading