Skip to content

Commit 05ce287

Browse files
authored
Created Hotswap Best Practice file. (#1041)
* Update README.md * Create hotswap.md
1 parent 882ffc7 commit 05ce287

File tree

2 files changed

+146
-0
lines changed

2 files changed

+146
-0
lines changed

best-practices/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,7 @@ This reference architecture is designed to assist platform administrators, cloud
1111
## [Best Practices for Faster Workload Cold Start](/best-practices/startup-latency.md)
1212

1313
To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency.
14+
15+
## [Running hero training job with hotswap](/best-practices/hotswap.md)
16+
17+
When running large scale training jobs, interruptions are inevitable. It's critical to setup your training job to be resilient to the interruptions to achieve the high [goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity). Hotswap is one recommended solution to improve the workload recovery time by leveraging the additional capacities.

best-practices/hotswap.md

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
## Use hotswap in your workload
2+
This doc describes how to set up your training job to improve the workload recovery time by utilizing hotswap on Google Kubernetes Engine (GKE).
3+
4+
## Introduction
5+
Hotswap is intended to reduce Mean-Time-To-Recovery(MTTR) by reacting to infrastructure failures and interruptions, and essentially placing the workload onto healthy resources. Workload recovery is gated by the infrastructure repair time, which could take up to 10 minutes depending on the hardware platforms. Hotswap could reduce this time to 1 minute so as to improve the overall training job [goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity).
6+
7+
## Hotswap Takes Effect
8+
Hotswap takes effect in 2 main ways:
9+
1. When the nodes hosting workloads become unhealthy, the job will be rescheduled onto eligible spare nodes upon interruption.
10+
2. If your workload is configured with [PriorityClass](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass), the job that is configured with higher priority will preempt the low priority jobs’ capacities in the same cluster upon interruptions.
11+
12+
## Example
13+
In this example, we will show how to set up the workload using [Jobset](https://github.com/kubernetes-sigs/jobset) together with PriorityClass to achieve hotswap. Jobset is where a lot of the magic takes place. The training jobs are using multi-host TPU slices and [Maxtext](https://github.com/AI-Hypercomputer/maxtext) framework for illustration.
14+
15+
To begin, let's set up two different Priority Classes to indicate our levels of priority.
16+
```
17+
kind: PriorityClass
18+
metadata:
19+
name: low-priority-job
20+
value: 1000000
21+
globalDefault: false
22+
description: "This priority class should be used for low priority pods only."
23+
```
24+
```
25+
kind: PriorityClass
26+
metadata:
27+
name: high-priority-job
28+
value: 2000000
29+
globalDefault: false
30+
description: "This priority class should be used for hero pods only."
31+
```
32+
Now we can create a high priority Jobset workload, making sure to add the priorityClassName to clearly differentiate the workload's priority. The high priority job is a multi-slice training job running on two 4x4 [Trillium](https://cloud.google.com/blog/products/compute/trillium-tpu-is-ga) slices to run a training job with LLama2 7B.
33+
34+
**Note: The yamls below are just meant to serve as samples to demonstrate what elements are needed for hotswap to be executed**
35+
```
36+
apiVersion: jobset.x-k8s.io/v1alpha2
37+
kind: JobSet
38+
metadata:
39+
name: high-jax-trillium
40+
annotations:
41+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
42+
spec:
43+
failurePolicy:
44+
maxRestarts: 10
45+
restartStrategy: BlockingRecreate
46+
replicatedJobs:
47+
- name: slice
48+
replicas: 2
49+
template:
50+
spec:
51+
backoffLimit: 0
52+
completions: 4
53+
parallelism: 4
54+
template:
55+
spec:
56+
nodeSelector:
57+
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
58+
cloud.google.com/gke-tpu-topology: 4x4
59+
hostNetwork: true
60+
dnsPolicy: ClusterFirstWithHostNet
61+
priorityClassName: high-priority-job
62+
containers:
63+
- name: jax-program
64+
image: <IMAGE LOCATION>
65+
command:
66+
- python3
67+
- MaxText/train.py
68+
- MaxText/configs/base.yml
69+
- model_name=llama2-7b
70+
- run_name=<UNIQUE RUN NAME>
71+
- steps=300
72+
- base_output_directory=gs://<OUTPUT BUCKET>
73+
- dataset_path=gs://max-datasets-rogue
74+
- max_target_length=4096
75+
- dataset_type=synthetic
76+
- enable_checkpointing=False
77+
resources:
78+
limits:
79+
google.com/tpu: 4
80+
```
81+
Then we can create a low priority Jobset workload, making sure to add the priorityClassName. The low priority job is a single-slice training job running on one 4x4 Trillium slice.
82+
```
83+
apiVersion: jobset.x-k8s.io/v1alpha2
84+
kind: JobSet
85+
metadata:
86+
name: low-jax-trillium
87+
annotations:
88+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
89+
spec:
90+
failurePolicy:
91+
maxRestarts: 10
92+
restartStrategy: BlockingRecreate
93+
replicatedJobs:
94+
- name: slice
95+
replicas: 1
96+
template:
97+
spec:
98+
backoffLimit: 0
99+
completions: 4
100+
parallelism: 4
101+
template:
102+
spec:
103+
nodeSelector:
104+
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
105+
cloud.google.com/gke-tpu-topology: 4x4
106+
hostNetwork: true
107+
dnsPolicy: ClusterFirstWithHostNet
108+
priorityClassName: low-priority-job
109+
containers:
110+
- name: jax-program
111+
image: <IMAGE LOCATION>
112+
command:
113+
- python3
114+
- MaxText/train.py
115+
- MaxText/configs/base.yml
116+
- model_name=llama2-7b
117+
- run_name=<UNIQUE RUN NAME>
118+
- steps=300
119+
- base_output_directory=gs://<OUTPUT BUCKET>
120+
- dataset_path=gs://max-datasets-rogue
121+
- max_target_length=4096
122+
- dataset_type=synthetic
123+
- enable_checkpointing=False
124+
resources:
125+
limits:
126+
google.com/tpu: 4
127+
```
128+
Now that we have clearly differentiated priorities for two different Jobset specifications, we can go ahead and deploy them using
129+
```
130+
kubectl apply -f low_prio_job.yaml
131+
kubectl apply -f high_prio_job.yaml
132+
```
133+
If the high priority job is interrupted by an infrastructure failure, the Jobset will restart the high priority job. The restart will preempt the low priority job so that the high priority job could be rescheduled without waiting for the failed infrastructure recovery. This happens in O(sec), drastically reducing workload idle time.
134+
If you want to test that your workload setup works, you can simulate workload interruption by draining one of the TPU nodepools that the high priority job is running on:
135+
136+
```kubectl drain -l cloud.google.com/gke-nodepool=${NODEPOOL_NAME}```
137+
138+
The high priority job is restarted and scheduled onto a healthy node pool. At the same time, the low priority job will be in failed status and belonging leader pod is in pending status. Then go ahead and uncordon the nodes to simulate the recovery of the infrastructure. You will then see the low priority job is rescheduled back to the nodepool that recovered:
139+
140+
```kubectl uncordon -l cloud.google.com/gke-nodepool=${NODEPOOL_NAME}```
141+
142+

0 commit comments

Comments
 (0)