Skip to content

v1.10

Latest
Compare
Choose a tag to compare
@neerajag007 neerajag007 released this 13 Feb 20:02
· 56 commits to main since this release
d64781d

Release v1.10 - 2024-01-07 Onwards

This release focuses on improving TPU support, enhancing security, and refining benchmarking tools for AI workloads on GKE.

New Features:

  • Added Terraform configurations to create A3U and A4M clusters, along with infrastructure switches for A3U and A4U (#969).
  • Introduced a multi-cluster batch processing platform example using GKE Autopilot, DWS, and Kueue with Multikueue enabled (#949).
  • Added a guide for running Skypilot on GKE with Dynamic Workload Scheduling and Kueue (#942).
  • Integrated the Kubernetes Security Validation Service (Shipshape) cluster scan into the project to enhance security validation and compliance (#935).
  • Benchmarking tool now scrapes more vLLM metrics for detailed performance analysis (#937).
  • Benchmarking script's request timeout is now configurable via the --request-timeout flag (#932).
  • Added NCCL test switching logic (#954).
  • Added recipe switching logic (#950).
  • Convert A3 Mega NCCL test from pods to a jobset (#961).
  • Added missing GCS module (#959).

Improvements:

  • Updated the default TPU webhook image to v1.2.2, which includes a fix for incorrect TPU_WORKER_HOSTNAMES caused by KubeRay controller truncation (#972).
  • Improved TPU provisioning by adding support for v6e and cross-project reservations (#851).
  • Prefix GCS bucket names with project_id to avoid conflicts (#975).
  • Replaced subnetwork name with vpc (#971).
  • Added a note to use the nightly Skypilot version when using serve on Autopilot, along with troubleshooting steps (#970).
  • Clarified that container-image should have a tag in disk image building (#939).
  • Made the GCS bucket optional in the benchmarking script (#924).
  • Added "models" to LPG sample.tfvars (#926).
  • Merged Helm scan with cluster scan (#957).
  • Made small fixes and temporarily disabled UI visibility toggle (#955).
  • Fixed header on variables file (#974).
  • Fixed duplicate deployment_name variable (#944).

Bug Fixes:

  • Fixed a bug in the TPU webhook where KubeRay's service name truncation resulted in incorrectly generated TPU_WORKER_HOSTNAMES (#963).
  • Addressed CVE-2024-45338 in the TPU webhook image by fixing an upstream golang.org/x/net vulnerability (#968).
  • Handled the bucket not found exception in the benchmarking script (#929).
  • Reverted the TGI image version to address out-of-GPU memory issues on L4 nodes (#931).