0.19.3-v1
Optimized networking for GCP H100 clusters
dstack
now automatically sets up GCP A3 Mega instances with GPUDirect-TCPXO optimized NCCL communication to take advantage of the 1800Gbps maximum network bandwidth. Here's NCCL tests results on an A3 Mega cluster provisioned with dstack
:
✗ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 166.6 50.34 47.19 N/A 164.1 51.11 47.92 N/A
16777216 262144 float none -1 204.6 82.01 76.89 N/A 203.8 82.30 77.16 N/A
33554432 524288 float none -1 284.0 118.17 110.78 N/A 281.7 119.12 111.67 N/A
67108864 1048576 float none -1 447.4 150.00 140.62 N/A 443.5 151.31 141.86 N/A
134217728 2097152 float none -1 808.3 166.05 155.67 N/A 801.9 167.38 156.92 N/A
268435456 4194304 float none -1 1522.1 176.36 165.34 N/A 1518.7 176.76 165.71 N/A
536870912 8388608 float none -1 2892.3 185.62 174.02 N/A 2894.4 185.49 173.89 N/A
1073741824 16777216 float none -1 5532.7 194.07 181.94 N/A 5530.7 194.14 182.01 N/A
2147483648 33554432 float none -1 10863 197.69 185.34 N/A 10837 198.17 185.78 N/A
4294967296 67108864 float none -1 21481 199.94 187.45 N/A 21466 200.08 187.58 N/A
8589934592 134217728 float none -1 42713 201.11 188.54 N/A 42701 201.16 188.59 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 146.948
Done
For more information on how to provision and use A3 Mega clusters with GPUDirect-TCPXO, see the A3 Mega example.
H200 and B200 support on Datacrunch
You can now provision H200 and B200 instances on DataCrunch. DataCrunch is the first dstack
backend to support B200:
✗ dstack apply --gpu B200
Project main
User admin
Configuration .dstack.yml
Type dev-environment
Resources 1..xCPU, 2GB.., 1xB200, 100GB.. (disk)
Max price -
Max duration -
Inactivity duration -
Spot policy auto
Retry policy -
Creation policy reuse-or-create
Idle duration 5m
Reservation -
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 datacrunch FIN-03 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) yes $1.3
2 datacrunch FIN-03 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) no $4.49
3 datacrunch FIN-01 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) yes $1.3 not available
...
Shown 3 of 8 offers, $4.49 max
Submit a new run? [y/n]:
CUDO improvements
The CUDO backend is updated to support H100, A100, A40 and all other GPUs currently offered by CUDO.
fleets
configuration property
With the new fleets
property and --fleet
dstack apply
option, it's now possible to restrict a set of fleets considered for reuse:
type: task
fleets: [my-fleet-1, my-fleet-2]
or
dstack apply --fleet my-fleet-1 --fleet my-fleet-2
What's Changed
- [Blog] Built-in UI for monitoring basic GPU metrics by @peterschmidt85 in dstackai/dstack#2470
- Fix Nebius project discovery by @jvstme in dstackai/dstack#2473
- Support A3 Mega GCP clusters with GPUDirect-TCPXO by @r4victor in dstackai/dstack#2469
- Fix Nebius private networks with non-default CIDR by @jvstme in dstackai/dstack#2475
- Add region for Lambda by @HSaddiq in dstackai/dstack#2471
- Fix relative date in CLI for weeks and months by @jvstme in dstackai/dstack#2481
- Fix terminating TensorDock instances by @jvstme in dstackai/dstack#2480
- Use all Lambda regions by default by @jvstme in dstackai/dstack#2478
- Allow mounting volumes into /workflow by @r4victor in dstackai/dstack#2483
- Improve Datacrunch backend by @r4victor in dstackai/dstack#2487
- UI improvements by @olgenn in dstackai/dstack#2489
- Add
fleets
property to run configurations and CLI by @un-def in dstackai/dstack#2488 - Fix GitIgnore by @un-def in dstackai/dstack#2491
- Remove hardcoded cudo regions by @r4victor in dstackai/dstack#2493
- Optimize GCP list usable subnets across regions by @r4victor in dstackai/dstack#2494
- Make regions filtering case insensitive by @r4victor in dstackai/dstack#2499
New Contributors
- @HSaddiq made their first contribution in dstackai/dstack#2471
Full Changelog: dstackai/dstack@0.19.2...0.19.3