Skip to content

How to adapt the Shared NVSwitch Virtualization Model of FM to activate nvlink in multi-gpu VMs #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mingyanghaha opened this issue Mar 26, 2025 · 1 comment

Comments

@mingyanghaha
Copy link

Why is this needed?

In a virtualized environment, for DGX/HGX A100/H100 systems, NVIDIA provides the Shared NVSwitch Virtualization Model solution to enable NVLink connections for multi-gpu VMs. This requires that the GPUs assigned to the VM must belong to the same partition.

What's the Shared NVSwitch Virtualization Model

Image

Only GPUs passed through to the guests.

  1. NVSwitch memory fabrics are managed by a dedicated trusted VM called Service VM.
  2. NVSwitch memory fabrics are shared by the guest VMs, but the fabrics are not visible to guests.
  3. Requires the tightest integration with the hypervisor.
  4. Complete bandwidth for two and four GPU VMs.
  5. No need for direct communication between the guest VM and the Service VM.

shared-nvswitch-virtualization-model

Proposal

The GPUs assigned to the VM must belong to the same partition.

How to assign the GPUs belong to the same partition

  • Implement the GetDevicePluginOptions interface to enable
    GetPreferredAllocationAvailable, allowing kubelet to request
    GetPreferredAllocation before allocating GPUs.
  • The GetPreferredAllocation interface recommends H100/H800 GPUs based on GPU
    partitioning.
  • The Allocate interface verifies whether the GPUs belong to the same
    partition during allocation.

The diagram below illustrates the partition tree for H100/H800. If partition 4 has already been allocated, partition 3 will be prioritized for the next allocation.

Image

@mingyanghaha
Copy link
Author

mingyanghaha commented Mar 31, 2025

@tariq1890 See if this proposal is feasible, and let me know if you have any suggestions. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant