-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[EBPF] gpu: handle runtime changes of CUDA_VISIBLE_DEVICES #38312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
599f1df
to
abd697b
Compare
28c3b13
to
a87e483
Compare
ae9f0d3
to
57df89a
Compare
a87e483
to
fae77c6
Compare
d335b7f
to
9f59632
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for capturing and handling runtime updates to the CUDA_VISIBLE_DEVICES
environment variable by introducing a new eBPF uprobe, event type, and propagation through the Go consumer and context.
- Defines a new
VisibleDevicesSet
event in eBPF C and Go types, tests, and string mappings - Implements a
setenv
uprobe to detect changes toCUDA_VISIBLE_DEVICES
at runtime - Propagates and invalidates updated env var values in
systemContext
and the GPU event consumer
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
pkg/network/usm/sharedlibraries/libset.go | Add libc libset constant and suffix mapping |
pkg/network/ebpf/c/shared-libraries/probes.h | Introduce match4chars macro and capture libc libset events |
pkg/network/ebpf/c/shared-libraries/maps.h | Add libc_shared_libraries perf event array |
pkg/gpu/probe.go | Register uprobe__setenv , update shared libs libsets slice |
pkg/gpu/ebpf/kprobe_types_string_linux.go | Append new event names to _CudaEventType_name and index |
pkg/gpu/ebpf/kprobe_types_linux_test.go | Add alignment test for CudaVisibleDevicesSetEvent |
pkg/gpu/ebpf/kprobe_types_linux.go | Define CudaVisibleDevicesSetEvent , update constants & sizes |
pkg/gpu/ebpf/kprobe_types.go | Declare Go type and constants for new event |
pkg/gpu/ebpf/c/types.h | Define cuda_visible_devices_set_t struct and related macros |
pkg/gpu/ebpf/c/runtime/gpu.c | Implement SEC("uprobe/setenv") to read and emit new event |
pkg/gpu/cuda/env_test.go | Update tests to call ParseVisibleDevices |
pkg/gpu/cuda/env.go | Rename and refactor env var parsing to ParseVisibleDevices |
pkg/gpu/context_test.go | Add tests for runtime-updated visible devices cache invalidation |
pkg/gpu/context.go | Track and invalidate updated env var values in systemContext |
pkg/gpu/consumer.go | Handle VisibleDevicesSet events and update GPU context |
pkg/ebpf/uprobes/attacher_test.go | Extend shared-library tests to include libc libset |
pkg/ebpf/cgo/genpost.go | Include "Devices" field in generated post-processing |
9f59632
to
63fab4d
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 636a14c Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | docker_containers_cpu | % cpu utilization | +1.96 | [-1.17, +5.09] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +1.79 | [-0.98, +4.56] | 1 | Logs bounds checks dashboard |
➖ | file_tree | memory utilization | +0.96 | [+0.83, +1.09] | 1 | Logs |
➖ | otlp_ingest_logs | memory utilization | +0.81 | [+0.68, +0.94] | 1 | Logs |
➖ | otlp_ingest_metrics | memory utilization | +0.34 | [+0.20, +0.48] | 1 | Logs |
➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.33 | [+0.28, +0.38] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.33 | [+0.25, +0.41] | 1 | Logs bounds checks dashboard |
➖ | quality_gate_idle | memory utilization | +0.25 | [+0.16, +0.34] | 1 | Logs bounds checks dashboard |
➖ | ddot_logs | memory utilization | +0.15 | [+0.06, +0.24] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.05 | [-0.53, +0.63] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.04 | [-0.59, +0.67] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.03 | [-0.55, +0.61] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.02 | [-0.21, +0.26] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.59, +0.62] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.02, +0.04] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.26, +0.24] | 1 | Logs |
➖ | docker_containers_memory | memory utilization | -0.01 | [-0.07, +0.05] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.60, +0.57] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.04 | [-0.62, +0.53] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.11 | [-0.67, +0.45] | 1 | Logs |
➖ | ddot_metrics | memory utilization | -0.18 | [-0.30, -0.06] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.29 | [-1.16, +0.59] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -1.00 | [-1.09, -0.92] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | docker_containers_cpu | simple_check_run | 10/10 | |
✅ | docker_containers_memory | memory_usage | 10/10 | |
✅ | docker_containers_memory | simple_check_run | 10/10 | |
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
63fab4d
to
fa7d736
Compare
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
fa7d736
to
bd61778
Compare
e6eef87
to
88da77c
Compare
88da77c
to
1b8b2b4
Compare
eBPF complexity changesSummary result: ❔ - needs attention
shared_libraries detailsshared_libraries [programs with changes]
shared_libraries [programs without changes]
This report was generated based on the complexity data for the current branch guillermo.julian/setenv-detect (pipeline 70195279, commit edcc32a) and the base branch main (commit 636a14c). Objects without changes are not reported. Contact #ebpf-platform if you have any questions/feedback. Table complexity legend: 🔵 - new; ⚪ - unchanged; 🟢 - reduced; 🔴 - increased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the USM owned files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good for docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few small comments
/merge |
View all feedbacks in Devflow UI.
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
The expected merge time in
|
What does this PR do?
This PR adds support for handling runtime updates for the
CUDA_VISIBLE_DEVICES
environment variable. This consists of the following changes:setenv
function inlibc
, that watches for the calls with theCUDA_VISIBLE_DEVICES
variable name exclusively. The value will be sent to the GPU event ringbuffer.libc
.setenv
. It's necessary to do for all processes, as a process might change its environment variable before calling any CUDA functions. If we probed only processes that have the CUDA library loaded we would miss those events.systemContext
.systemContext
will prioritize this value now over the one from/proc/pid/environ
when finding the devices visible to a process.CUDA_VISIBLE_DEVICES
, the code retrieving the environment variable for the process has been moved from thepkg/gpu/cuda
package tosystemContext
.Motivation
https://datadoghq.atlassian.net/browse/NWL-1275
We have detected an issue in a
ray
cluster. When deployed in a node with multiple GPUs,ray
will spawn one process per GPU. Each process will then update its ownCUDA_VISIBLE_DEVICES
environment variable. Our current approach will not detect this change, and will consider that all processes are using the same GPU.Describe how you validated your changes
Unit tests provided, tested also in the GPU load test environment with no significant changes in resource usage.
Finally, manually tested this in an AWS instance with Ray running on multiple GPUs, replicating the environment with the original error.
Possible Drawbacks / Trade-offs
Additional Notes