Created Hotswap Best Practice file. (#1041)

rishibathina · web-flow · commit 05ce2875c9d9 · 2025-03-27T13:57:58.000-07:00
* Update README.md

* Create hotswap.md
diff --git a/best-practices/README.md b/best-practices/README.md
@@ -11,3 +11,7 @@ This reference architecture is designed to assist platform administrators, cloud
 ## [Best Practices for Faster Workload Cold Start](/best-practices/startup-latency.md)
 
 To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency.
+
+## [Running hero training job with hotswap](/best-practices/hotswap.md)
+
+When running large scale training jobs, interruptions are inevitable. It's critical to setup your training job to be resilient to the interruptions to achieve the high [goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity). Hotswap is one recommended solution to improve the workload recovery time by leveraging the additional capacities.
diff --git a/best-practices/hotswap.md b/best-practices/hotswap.md
@@ -0,0 +1,142 @@
+## Use hotswap in your workload
+This doc describes how to set up your training job to improve the workload recovery time by utilizing hotswap on Google Kubernetes Engine (GKE).
+
+## Introduction
+Hotswap is intended to reduce Mean-Time-To-Recovery(MTTR) by reacting to infrastructure failures and interruptions, and essentially placing the workload onto healthy resources. Workload recovery is gated by the infrastructure repair time, which could take up to 10 minutes depending on the hardware platforms. Hotswap could reduce this time to 1 minute so as to improve the overall training job [goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity).
+
+## Hotswap Takes Effect
+Hotswap takes effect in 2 main ways:
+1. When the nodes hosting workloads become unhealthy, the job will be rescheduled onto eligible spare nodes upon interruption.
+2. If your workload is configured with [PriorityClass](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass), the job that is configured with higher priority will preempt the low priority jobs’ capacities in the same cluster upon interruptions. 
+
+## Example
+In this example, we will show how to set up the workload using [Jobset](https://github.com/kubernetes-sigs/jobset) together with PriorityClass to achieve hotswap. Jobset is where a lot of the magic takes place. The training jobs are using multi-host TPU slices and [Maxtext](https://github.com/AI-Hypercomputer/maxtext) framework for illustration.
+
+To begin, let's set up two different Priority Classes to indicate our levels of priority.
+```
+kind: PriorityClass
+metadata:
+  name: low-priority-job
+value: 1000000
+globalDefault: false
+description: "This priority class should be used for low priority pods only."
+```
+```
+kind: PriorityClass
+metadata:
+  name: high-priority-job
+value: 2000000
+globalDefault: false
+description: "This priority class should be used for hero pods only."
+```
+Now we can create a high priority Jobset workload, making sure to add the priorityClassName to clearly differentiate the workload's priority. The high priority job is a multi-slice training job running on two 4x4 [Trillium](https://cloud.google.com/blog/products/compute/trillium-tpu-is-ga) slices to run a training job with LLama2 7B. 
+
+**Note: The yamls below are just meant to serve as samples to demonstrate what elements are needed for hotswap to be executed**
+```
+apiVersion: jobset.x-k8s.io/v1alpha2
+kind: JobSet
+metadata:
+  name: high-jax-trillium
+  annotations:
+    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
+spec:
+  failurePolicy:
+    maxRestarts: 10
+    restartStrategy: BlockingRecreate
+  replicatedJobs:
+  - name: slice
+    replicas: 2
+    template:
+      spec:
+        backoffLimit: 0
+        completions: 4
+        parallelism: 4
+        template:
+          spec:
+            nodeSelector:
+              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
+              cloud.google.com/gke-tpu-topology: 4x4
+            hostNetwork: true
+            dnsPolicy: ClusterFirstWithHostNet
+            priorityClassName: high-priority-job
+            containers:
+            - name: jax-program
+              image: <IMAGE LOCATION>
+              command:
+              - python3
+              - MaxText/train.py
+              - MaxText/configs/base.yml
+              - model_name=llama2-7b
+              - run_name=<UNIQUE RUN NAME>
+              - steps=300 
+              - base_output_directory=gs://<OUTPUT BUCKET>
+              - dataset_path=gs://max-datasets-rogue
+              - max_target_length=4096
+              - dataset_type=synthetic
+              - enable_checkpointing=False
+              resources:
+                limits:
+                  google.com/tpu: 4
+```
+Then we can create a low priority Jobset workload, making sure to add the priorityClassName. The low priority job is a single-slice training job running on one 4x4 Trillium slice.
+```
+apiVersion: jobset.x-k8s.io/v1alpha2
+kind: JobSet
+metadata:
+  name: low-jax-trillium
+  annotations:
+    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
+spec:
+  failurePolicy:
+    maxRestarts: 10
+    restartStrategy: BlockingRecreate
+  replicatedJobs:
+  - name: slice
+    replicas: 1
+    template:
+      spec:
+        backoffLimit: 0
+        completions: 4
+        parallelism: 4
+        template:
+          spec:
+            nodeSelector:
+              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
+              cloud.google.com/gke-tpu-topology: 4x4
+            hostNetwork: true
+            dnsPolicy: ClusterFirstWithHostNet
+            priorityClassName: low-priority-job
+            containers:
+            - name: jax-program
+              image: <IMAGE LOCATION>
+              command:
+              - python3
+              - MaxText/train.py
+              - MaxText/configs/base.yml
+              - model_name=llama2-7b
+              - run_name=<UNIQUE RUN NAME>
+              - steps=300 
+              - base_output_directory=gs://<OUTPUT BUCKET>
+              - dataset_path=gs://max-datasets-rogue
+              - max_target_length=4096
+              - dataset_type=synthetic
+              - enable_checkpointing=False
+              resources:
+                limits:
+                  google.com/tpu: 4
+```
+Now that we have clearly differentiated priorities for two different Jobset specifications, we can go ahead and deploy them using
+```
+kubectl apply -f low_prio_job.yaml
+kubectl apply -f high_prio_job.yaml
+```
+If the high priority job is interrupted by an infrastructure failure, the Jobset will restart the high priority job. The restart will preempt the low priority job so that the high priority job could be rescheduled without waiting for the failed infrastructure recovery. This happens in O(sec), drastically reducing workload idle time. 
+If you want to test that your workload setup works, you can simulate workload interruption by draining one of the TPU nodepools that the high priority job is running on: 
+
+```kubectl drain -l cloud.google.com/gke-nodepool=${NODEPOOL_NAME}```
+
+The high priority job is restarted and scheduled onto a healthy node pool. At the same time, the low priority job will be in failed status and belonging leader pod is in pending status. Then go ahead and uncordon the nodes to simulate the recovery of the infrastructure. You will then see the low priority job is rescheduled back to the nodepool that recovered:
+
+```kubectl uncordon -l cloud.google.com/gke-nodepool=${NODEPOOL_NAME}```
+
+