Release v1.10 - 2024-01-07 Onwards
This release focuses on improving TPU support, enhancing security, and refining benchmarking tools for AI workloads on GKE.
New Features:
- Added Terraform configurations to create A3U and A4M clusters, along with infrastructure switches for A3U and A4U (#969).
- Introduced a multi-cluster batch processing platform example using GKE Autopilot, DWS, and Kueue with Multikueue enabled (#949).
- Added a guide for running Skypilot on GKE with Dynamic Workload Scheduling and Kueue (#942).
- Integrated the Kubernetes Security Validation Service (Shipshape) cluster scan into the project to enhance security validation and compliance (#935).
- Benchmarking tool now scrapes more vLLM metrics for detailed performance analysis (#937).
- Benchmarking script's request timeout is now configurable via the
--request-timeout
flag (#932). - Added NCCL test switching logic (#954).
- Added recipe switching logic (#950).
- Convert A3 Mega NCCL test from pods to a jobset (#961).
- Added missing GCS module (#959).
Improvements:
- Updated the default TPU webhook image to v1.2.2, which includes a fix for incorrect
TPU_WORKER_HOSTNAMES
caused by KubeRay controller truncation (#972). - Improved TPU provisioning by adding support for v6e and cross-project reservations (#851).
- Prefix GCS bucket names with
project_id
to avoid conflicts (#975). - Replaced subnetwork name with vpc (#971).
- Added a note to use the nightly Skypilot version when using serve on Autopilot, along with troubleshooting steps (#970).
- Clarified that
container-image
should have a tag in disk image building (#939). - Made the GCS bucket optional in the benchmarking script (#924).
- Added "models" to
LPG sample.tfvars
(#926). - Merged Helm scan with cluster scan (#957).
- Made small fixes and temporarily disabled UI visibility toggle (#955).
- Fixed header on variables file (#974).
- Fixed duplicate
deployment_name
variable (#944).
Bug Fixes:
- Fixed a bug in the TPU webhook where KubeRay's service name truncation resulted in incorrectly generated
TPU_WORKER_HOSTNAMES
(#963). - Addressed CVE-2024-45338 in the TPU webhook image by fixing an upstream
golang.org/x/net
vulnerability (#968). - Handled the bucket not found exception in the benchmarking script (#929).
- Reverted the TGI image version to address out-of-GPU memory issues on L4 nodes (#931).