Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

r4victor · 2025-04-23T08:21:40Z

Closes #2532
Follows #2469

The PR adds support for GPUDirect-TCPX optimized networking on GCP instances a3-edgegpu-8g and a3-highgpu-8g as described in https://cloud.google.com/compute/docs/gpus/gpudirect.

Unlike GPUDirect-TCPXO (#2469) that uses custom debian-based image, GCP does not provide instructions to build your own image for GPUDirect-TCPX based on a public image, so this PR uses COS image as recommended by the guide. The cluster-toolkit guide does have scripts to build image for GPUDirect-TCPX, but it requires access to a private image:

Before beginning, submit a request to your Google Cloud representative for access to the Deep Learning VM Image for a3-highgpu-8g. It is currently available only by Private Preview request.

I contacted GCP support for the access, still waiting. It's preferable we build own our image for GPUDirect-TCPX so that we can use nvidia-container-toolkit (unavailable in COS), simplify the implementation, and improve provisioning time.

Implementation notes:

Uses the COS image instead of dstack image for a3-edgegpu-8g and a3-highgpu-8g.
As COS is incompatible with nvidia-container-toolkit, extends the shim API to accept gpu_devices in task config. Allows the server to override shim's default --gpu all with explicit device mapping.
Parametrizes shim/runner bin paths and shim working dir path since it's not possible use the default path on COS (can only use /etc as other paths are read-only).

TODO:

GPUDirect-TCPX example
Test a3-edgegpu-8g. So far tested only a3-highgpu-8g since a3-edgegpu-8g is unavailable.

r4victor · 2025-04-24T06:10:37Z

Also forgot to mention that DCGM metrics were not yet supported for a3-edgegpu-8g and a3-highgpu-8g.

r4victor added 3 commits April 22, 2025 11:48

Support gpu_devices in task config

5c983fd

Implement tcpx prototype

5470967

Parametrize shim and runner host paths

5888283

r4victor requested a review from un-def April 23, 2025 08:24

un-def approved these changes Apr 23, 2025

View reviewed changes

r4victor merged commit 16ddda8 into master Apr 23, 2025
24 checks passed

r4victor deleted the issue_2532_tcpx branch April 23, 2025 08:39

r4victor mentioned this pull request Apr 30, 2025

[Bug]: Runs with volumes fail on a3-highgpu-8 #2583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

Uh oh!

r4victor commented Apr 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

r4victor commented Apr 24, 2025

Uh oh!

Uh oh!

Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

Uh oh!

Conversation

r4victor commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

r4victor commented Apr 24, 2025

Uh oh!

Uh oh!

r4victor commented Apr 23, 2025 •

edited

Loading