Skip to content

Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 23, 2025
Merged

Conversation

r4victor
Copy link
Collaborator

@r4victor r4victor commented Apr 23, 2025

Closes #2532
Follows #2469

The PR adds support for GPUDirect-TCPX optimized networking on GCP instances a3-edgegpu-8g and a3-highgpu-8g as described in https://cloud.google.com/compute/docs/gpus/gpudirect.

Unlike GPUDirect-TCPXO (#2469) that uses custom debian-based image, GCP does not provide instructions to build your own image for GPUDirect-TCPX based on a public image, so this PR uses COS image as recommended by the guide. The cluster-toolkit guide does have scripts to build image for GPUDirect-TCPX, but it requires access to a private image:

Before beginning, submit a request to your Google Cloud representative for access to the Deep Learning VM Image for a3-highgpu-8g. It is currently available only by Private Preview request.

I contacted GCP support for the access, still waiting. It's preferable we build own our image for GPUDirect-TCPX so that we can use nvidia-container-toolkit (unavailable in COS), simplify the implementation, and improve provisioning time.

Implementation notes:

  • Uses the COS image instead of dstack image for a3-edgegpu-8g and a3-highgpu-8g.
  • As COS is incompatible with nvidia-container-toolkit, extends the shim API to accept gpu_devices in task config. Allows the server to override shim's default --gpu all with explicit device mapping.
  • Parametrizes shim/runner bin paths and shim working dir path since it's not possible use the default path on COS (can only use /etc as other paths are read-only).

TODO:

  • GPUDirect-TCPX example
  • Test a3-edgegpu-8g. So far tested only a3-highgpu-8g since a3-edgegpu-8g is unavailable.

@r4victor r4victor requested a review from un-def April 23, 2025 08:24
@r4victor r4victor merged commit 16ddda8 into master Apr 23, 2025
24 checks passed
@r4victor r4victor deleted the issue_2532_tcpx branch April 23, 2025 08:39
@r4victor
Copy link
Collaborator Author

Also forgot to mention that DCGM metrics were not yet supported for a3-edgegpu-8g and a3-highgpu-8g.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Support A3 High Multi-Node GCP clusters with GPUDirect-TCPX
2 participants