Skip to content

Exec Resource Monitoring + OTEL Metrics support #8506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 18, 2024
Merged

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented Sep 20, 2024

This PR adds support for OTel metrics to the engine+CLI (TUI specifically) and publishes some initial metrics for exec-ops that's parsed from their cgroup.

  • For now, the metrics are just total bytes read/written from/to the disk and IO pressure
  • More metrics can easily be added iteratively as desired.

Metrics are currently only shown in the TUI on -vvv verbosity or above.

Example of building the engine (metrics show up in green after exec has been running for at least 3 seconds or has finished):
asciicast

OTel Metrics Primer

Braindump of what I've learned about OTel metrics:

  1. To publish metrics, you need a Meter, which you get from a MeterProvider (described more below).
    • A Meter is associated with a particular Resource and lets you create all sorts of different Instrument kinds for recording different types of data (link to different kinds)
    • Currently, I only use an Int64Gauge since it fits the collected data so far, but we'll likely need to use more in time.
    • Usage in code
  2. Meters have an associated Aggregation and Temporality that controls whether/how data points are aggregated/summarized before continuing down the pipeline.
  3. On the other side of the pipeline there's an Exporter, which is the same as the rest of OTel: it's an interface to a place you can push metrics (OTLP, etc.)
  4. There's two components that connect together a Meter and an Exporter: a Reader and MeterProvider.
    • A MeterProvider is created directly from a Reader and as we currently use it is boilerplate for adapting the two interfaces. It lets you create Meters that are hooked up to the Reader
    • One thing I didn't use yet are Views, which get associated with the MeterProvider and allow you to customize the metrics stream created by Meters/Instruments created from the MeterProvider.
    • The Reader is somewhat more meaningful. There's two implementation of it:
      • ManualReader: metrics will be collected and returned to the user by calling the Collect method on the Reader. The user then needs to publish them to an Exporter (or whatever)
      • PeriodicReader: metrics will be automatically collected on an interval and published to a given Exporter in the background
    • For now, I've just used a PeriodicReader everywhere applicable since it's simplest. Wouldn't be surprised if we want a ManualReader with custom logic in the future (i.e. to optimize by exporting based on both time+buffer size)
    • Usage in code:

Useful links:

Cgroup based metrics

I ended up going with our own custom implementation of cgroup monitoring rather than re-using upstream's code. This allowed us to simplify and specialize it for OTel metrics integration more easily.

Monitoring the cgroup amounts to just periodically parsing some files under /sys/fs/cgroup for the exec and publishing them on some OTel metric Instruments.

  • The format of the files being parsed can be found in kernel docs here

I started with Disk bytes read/written and IO pressure because:

  • They are simple Int64Gauges that don't require deltas, aggregation, etc.
  • They are most immediately relevant to bottlenecks we are currently investigating

Adding support for more exec metrics will mostly be a matter of parsing more files.

  • One notable exception is network usage, which instead needs to be sampled from the CNI netns interface, but otherwise it should be the same idea.

Misc notes:

  • The metrics are currently sampled:
    • Every 3 seconds while the exec is running
    • A single time after the exec exits to get final values (or the only values if the exec lasted less than 3s)
  • I found that you will often get inconsistent/unexpectedly-low read/write disk bytes if you don't include a sync in the exec and don't use direct io for reading.
    • This is because it's counting actual read/writes that hit the disk, not counting the kernels read/write cache in memory.
    • This is fine, but cgroups do appear to also give info on read/write caches, so we could consider aggregating if needed

Basically working, just doing cleanup

TODO (mostly notes for self):

  • Try using span id instead of call digest (keep call digest as attr though)
  • Add tests
  • Fix plain progress
  • Fix forwarding from nested sessions (so we can get data from dagger call test)
  • Add io pressure
  • Doc a bit more
  • Remove extraneous // TODO:s
  • Cleanup commit history
  • Hide metrics behind some verbosity level (TBD which)
    • this happened as a side effect of displaying it alongside spans rather than calls; only shows up at -vvv now which seems reasonable to start
  • Check performance impact on CI, tune various intervals if needed
  • Double check if any duplicate code ended up in transform.go (there was a lot of copypasta from internal otel packages, may have duplicated something existing on accident)
  • Fix flaky test (or remove for now pending Better errors, cached states, pending states, duration accounting, fewer spans #8442)

@@ -1124,3 +1144,112 @@ func (w *Worker) installCACerts(ctx context.Context, state *execState) error {

return nil
}

func (w *Worker) runContainer(ctx context.Context, state *execState) (rerr error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I refactored this step here because the cgroup monitoring needs close synchronization w/ the actual execution of the container process so that it can collect a final cgroup sample before runc.Delete is called (which rm's the cgroup). Seemed like it was worth encapsulating this complexity into a setup func like we do for the rest of the executor.

-- ) RETURNING id;
-- name: InsertMetric :one
INSERT INTO metrics (
data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vito I saw that spans and logs kept individual rows for a lot of the data and was gonna do the same for metrics, but there was some annoyance around writing the code for ser/deser of protobuf enums so I just took the easy route and serialized the whole protobuf object with protojson.

It seems like we could probably get away with the same thing for spans+logs too, which might simplify some of this code a bit. I'm happy to leave everything as is since it all works, just checking with you whether there was some longer-term plan around having these individual fields in sqlite vs. just persisting the whole protobuf obj as a blob.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works - let's see how it goes. I would leave the rest as-is for now, since we've tossed around ideas for exposing OTel data through the Dagger API somehow, and it'll be easier to do things like "subscribe to this span" with a proper schema.

@sipsma sipsma force-pushed the perfmon branch 2 times, most recently from 9cf2e8b to ac3e333 Compare October 9, 2024 21:47
@@ -42,3 +43,27 @@ func (TelemetrySuite) TestInternalVertexes(ctx context.Context, t *testctx.T) {
require.NotContains(t, logs.String(), "merge (")
})
}

func (TelemetrySuite) TestMetrics(ctx context.Context, t *testctx.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is flaky in CI, I believe just due to the fact that all the metric plumbing is inherently time based and thus sometimes slower in some test runs.

Best option is probably to rebase on #8442 and use the new telemetry test setup there.

@sipsma sipsma force-pushed the perfmon branch 3 times, most recently from 7c0e94c to 9da716e Compare October 16, 2024 00:23
@sipsma sipsma marked this pull request as ready for review October 16, 2024 00:26
@sipsma sipsma requested a review from a team as a code owner October 16, 2024 00:26
@sipsma sipsma requested review from jedevc and vito October 16, 2024 00:26
@sipsma sipsma force-pushed the perfmon branch 2 times, most recently from 3d70079 to bd6c4ba Compare October 16, 2024 00:45
Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm I can't seem to get any readings out of it with this:

dagger-dev core container from --address busybox with-exec --args dd,if=/dev/random,of=foo,bs=1M,count=1234 with-exec --args sleep,10 stdout -vvv

Output:

✔ Container.withExec(args: ["dd", "if=/dev/random", "of=foo", "bs=1M", "count=1234"]): Container! 4.1s
┃ 1234+0 records in
┃ 1234+0 records out
┃ 1293942784 bytes (1.2GB) copied, 3.565858 seconds, 346.1MB/s
  ✔ load cache: creating dagger metadata 0.0s
  ✔ exec dd if=/dev/random of=foo bs=1M count=1234 4.1s
✔ Container.withExec(args: ["sleep", "10"]): Container! 14.2s
  ✔ exec sleep 10 10.1s
✔ Container.stdout: String! 14.2s

Approving anyway on the assumption I missed something.

}
if spans, logs, ok := enginetel.ConfiguredCloudExporters(ctx); ok {
telemetryCfg.LiveTraceExporters = append(telemetryCfg.LiveTraceExporters, spans)
telemetryCfg.LiveLogExporters = append(telemetryCfg.LiveLogExporters, logs)
// TODO: metrics to cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - yep, need to get a schema and API set up on that side first

Comment on lines 87 to 91
// OTel metric attribute so we can correlate metrics with spans
MetricsSpanID = "dagger.io/metrics.span"

// OTel metric attribute so we can correlate metrics with traces
MetricsTraceID = "dagger.io/metrics.trace"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, for consistency:

Suggested change
// OTel metric attribute so we can correlate metrics with spans
MetricsSpanID = "dagger.io/metrics.span"
// OTel metric attribute so we can correlate metrics with traces
MetricsTraceID = "dagger.io/metrics.trace"
// OTel metric attribute so we can correlate metrics with spans
MetricsSpanIDAttr = "dagger.io/metrics.span"
// OTel metric attribute so we can correlate metrics with traces
MetricsTraceIDAttr = "dagger.io/metrics.trace"

The main benefit of this is to free up the un-Attr-suffixed name for a helper that sets the attribute, e.g. telemetry.MetricsTraceID(traceID)

Comment on lines 93 to 109
// OTel metric for number of bytes read from disk by a container, as parsed from its cgroup
IOStatDiskReadBytes = "dagger.io/metrics.iostat.disk.readbytes"

// OTel metric for number of bytes written to disk by a container, as parsed from its cgroup
IOStatDiskWriteBytes = "dagger.io/metrics.iostat.disk.writebytes"

// OTel metric for number of microseconds SOME tasks in a cgroup were stalled on IO
IOStatPressureSomeTotal = "dagger.io/metrics.iostat.pressure.some.total"

// OTel metric units should be in UCUM format
// https://unitsofmeasure.org/ucum

// Bytes unit for OTel metrics
ByteUnitName = "byte"

// Microseconds unit for OTel metrics
MicrosecondUnitName = "us"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mega nit: these aren't attributes, so it's a little jarring seeing them mixed in. If it's convenient to have them in the same file, we could rename this file to consts.go and group them separately? Or maybe a metrics.go would be nice for more targeted file switching? 🤷‍♂️ Just picking nits. Feel free to merge regardless, can tidy up async

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪠

@@ -279,6 +282,7 @@ func (r renderer) renderSpan(
// TODO: when a span has child spans that have progress, do 2-d progress
// fe.renderVertexTasks(out, span, depth)
r.renderDuration(out, span)
r.renderMetrics(out, span)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these show up by default for everyone now? Wondering if we should tie it to a verbosity level. 🤔

edit: oh I guess since it's tied to the internal Buildkit spans they already won't see anything until cranking up the verbosity level, so 👍

@sipsma
Copy link
Contributor Author

sipsma commented Oct 16, 2024

@vito

The secret incantation is:

dagger-dev core container from --address busybox with-exec --args dd,if=/dev/random,of=foo,bs=1M,count=1234,oflag=direct with-exec --args sleep,10 stdout -vvv

Diff being oflag=direct. I hit this too and believe (like 90% sure) what's happening is the kernel read/write caches are getting hit instead of the actual disk. oflag=direct enables directs io so you skip the caches.

  • Also worth noting that if it's cached you get no data (as opposed to cached metrics from previous run), so need a cache bust

Right now the metrics only show up once they are non-zero, so in your case the metrics for that exec were all 0 and thus never showed up.

To avoid making newer clients incompatible w/ older engines, we can just
skip exporting metrics if the engine doesn't know about them.

Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Erik Sipsma <[email protected]>
@sipsma sipsma merged commit 1f9d373 into dagger:main Oct 18, 2024
57 checks passed
@sipsma sipsma added this to the v0.13.6 milestone Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants