Support A3 High/Edge GCP clusters with GPUDirect-TCPX #2549
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #2532
Follows #2469
The PR adds support for GPUDirect-TCPX optimized networking on GCP instances a3-edgegpu-8g and a3-highgpu-8g as described in https://cloud.google.com/compute/docs/gpus/gpudirect.
Unlike GPUDirect-TCPXO (#2469) that uses custom debian-based image, GCP does not provide instructions to build your own image for GPUDirect-TCPX based on a public image, so this PR uses COS image as recommended by the guide. The cluster-toolkit guide does have scripts to build image for GPUDirect-TCPX, but it requires access to a private image:
I contacted GCP support for the access, still waiting. It's preferable we build own our image for GPUDirect-TCPX so that we can use nvidia-container-toolkit (unavailable in COS), simplify the implementation, and improve provisioning time.
Implementation notes:
--gpu all
with explicit device mapping.TODO: