Skip to content

Commit d64781d

Browse files
authored
add dws multiclusters example folder (#949)
* add dws multiclusters example folder * add terraform support * update README Terraform command
1 parent 48b0571 commit d64781d

13 files changed

+796
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.kubeconfig
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Multikueue-dws-integration
2+
3+
This repository provides the files needed to demonstrate how to use MultiKueue with Dynamic Workload Scheduler (DWS) GKE Autopilot. This setup allows you to run workloads across multiple GKE clusters in different regions, automatically leveraging available GPU resources thanks to DWS.
4+
5+
## Repository Contents
6+
7+
This repository contains the following files:
8+
9+
* `create-clusters.sh`: Script to create the required GKE clusters (one manager and three workers).
10+
* `tf folder`: contains the terraform script to create the required GKE clusters (one manager and three workers). You can use it instead of the bash script.
11+
* `deploy-multikueue.sh`: Script to install and configure Kueue and MultiKueue on the clusters.
12+
* `dws-multi-worker.yaml`: Kueue configuration for the worker clusters, including manager configuration.
13+
* `job-multi-dws-autopilot.yaml`: Example job definition to be submitted to the MultiKueue setup.
14+
15+
## Setup and Usage
16+
17+
### Create Clusters
18+
19+
```
20+
terraform -chdir=tf init
21+
terraform -chdir=tf plan
22+
terraform -chdir=tf apply -var project_id=<YOUR PROJECT ID>
23+
```
24+
25+
### Install Kueue
26+
27+
After creating the GKE clusters and updating your kubeconfig files, install the Kueue components:
28+
29+
```
30+
./deploy-multikueue.sh
31+
```
32+
33+
### Validate installation
34+
35+
Verify the Kueue installation and the connection between the manager and worker clusters:
36+
37+
```
38+
kubectl get clusterqueues dws-cluster-queue -o jsonpath="{range .status.conditions[?(@.type == \"Active\")]}CQ - Active: {@.status} Reason: {@.reason} Message: {@.message}{'\n'}{end}"
39+
kubectl get admissionchecks sample-dws-multikueue -o jsonpath="{range .status.conditions[?(@.type == \"Active\")]}AC - Active: {@.status} Reason: {@.reason} Message: {@.message}{'\n'}{end}"
40+
kubectl get multikueuecluster multikueue-dws-worker-asia -o jsonpath="{range .status.conditions[?(@.type == \"Active\")]}MC-ASIA - Active: {@.status} Reason: {@.reason} Message: {@.message}{'\n'}{end}"
41+
kubectl get multikueuecluster multikueue-dws-worker-us -o jsonpath="{range .status.conditions[?(@.type == \"Active\")]}MC-US - Active: {@.status} Reason: {@.reason} Message: {@.message}{'\n'}{end}"
42+
kubectl get multikueuecluster multikueue-dws-worker-eu -o jsonpath="{range .status.conditions[?(@.type == \"Active\")]}MC-EU - Active: {@.status} Reason: {@.reason} Message: {@.message}{'\n'}{end}"
43+
```
44+
45+
A successful output should look like this:
46+
47+
```
48+
CQ - Active: True Reason: Ready Message: Can admit new workloads
49+
AC - Active: True Reason: Active Message: The admission check is active
50+
MC-ASIA - Active: True Reason: Active Message: Connected
51+
MC-US - Active: True Reason: Active Message: Connected
52+
MC-EU - Active: True Reason: Active Message: Connected
53+
```
54+
55+
### Launch job
56+
57+
Submit your job to the Kueue controller, which will run it on a worker cluster with available resources:
58+
59+
```
60+
kubectl create -f job-multi-dws-autopilot.yaml
61+
```
62+
63+
### Get the status of the job
64+
65+
To check the job status and see where it's scheduled:
66+
67+
```
68+
kubectl get workloads.kueue.x-k8s.io -o jsonpath='{range .items[*]}{.status.admissionChecks}{"\n"}{end}'
69+
```
70+
71+
In the output message, you can find where the job is scheduled#
72+
73+
### Destroy resources
74+
75+
76+
```
77+
terraform -chdir=tf destroy -var project_id=<YOUR PROJECT ID>
78+
```
79+
80+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/bin/bash
2+
3+
# Copyright 2024 The Kubernetes Authors.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
set -o errexit
18+
set -o nounset
19+
set -o pipefail
20+
21+
echo 'Create GKE Autopilot clusters'
22+
23+
KUEUE_VERSION=v0.8.1
24+
regions=("europe-west4" "asia-southeast1" "us-east4" "europe-west4")
25+
kubeconfigs=("manager-europe-west4" "worker-asia-southeast1" "worker-us-east4" "worker-eu-west4")
26+
PROJECT_ID=$(gcloud config get-value project)
27+
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
28+
PREFIX_MANAGER="man"
29+
PREFIX_WORKER="w"
30+
JOBSET_VERSION=v0.6.0
31+
32+
# Loop through the regions
33+
for i in "${!regions[@]}"; do
34+
region="${regions[$i]}"
35+
echo "$region"
36+
# Construct the cluster name, adding "manager" if it's the first region
37+
if [[ $i -eq 0 ]]; then
38+
cluster_name="$PREFIX_MANAGER-$region"
39+
else
40+
cluster_name="$PREFIX_WORKER-$region"
41+
fi
42+
43+
#Create the cluster
44+
gcloud container clusters create-auto "$cluster_name" \
45+
--project "$PROJECT_ID" \
46+
--region "$region" \
47+
--release-channel "regular" \
48+
--async
49+
done
50+
for i in "${!regions[@]}"; do
51+
region="${regions[$i]}"
52+
if [[ $i -eq 0 ]]; then
53+
cluster_name="$PREFIX_MANAGER-$region"
54+
else
55+
cluster_name="$PREFIX_WORKER-$region"
56+
fi
57+
58+
opId=$(gcloud container operations list --filter "TARGET=https://container.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/$region/clusters/$cluster_name" --format="value(name)")
59+
gcloud container operations wait "$opId" --project "$PROJECT_ID" --region "$region"
60+
set +e
61+
until gcloud -q container clusters get-credentials "$cluster_name" \
62+
--project "$PROJECT_ID" \
63+
--region "$region"; do
64+
echo "GKE Cluster is provisioning. Retrying in 15 seconds..."
65+
sleep 15
66+
done
67+
set -e
68+
configname="${kubeconfigs[$i]}"
69+
kubectl config rename-context "gke_$PROJECT_ID"_"$region"_"$cluster_name" "$configname"
70+
done
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
#!/bin/bash
2+
3+
# Copyright 2024 The Kubernetes Authors.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
set -o errexit
18+
set -o nounset
19+
set -o pipefail
20+
21+
KUBECONFIG_OUT=${1:-kubeconfig}
22+
MULTIKUEUE_SA=multikueue-sa
23+
NAMESPACE=kueue-system
24+
25+
# Creating a restricted MultiKueue role, service account and role binding"
26+
kubectl apply -f - <<EOF
27+
apiVersion: v1
28+
kind: ServiceAccount
29+
metadata:
30+
name: ${MULTIKUEUE_SA}
31+
namespace: ${NAMESPACE}
32+
---
33+
apiVersion: rbac.authorization.k8s.io/v1
34+
kind: ClusterRole
35+
metadata:
36+
name: ${MULTIKUEUE_SA}-role
37+
rules:
38+
- apiGroups:
39+
- batch
40+
resources:
41+
- jobs
42+
verbs:
43+
- create
44+
- delete
45+
- get
46+
- list
47+
- watch
48+
- apiGroups:
49+
- batch
50+
resources:
51+
- jobs/status
52+
verbs:
53+
- get
54+
- apiGroups:
55+
- jobset.x-k8s.io
56+
resources:
57+
- jobsets
58+
verbs:
59+
- create
60+
- delete
61+
- get
62+
- list
63+
- watch
64+
- apiGroups:
65+
- jobset.x-k8s.io
66+
resources:
67+
- jobsets/status
68+
verbs:
69+
- get
70+
- apiGroups:
71+
- kueue.x-k8s.io
72+
resources:
73+
- workloads
74+
verbs:
75+
- create
76+
- delete
77+
- get
78+
- list
79+
- watch
80+
- update
81+
- apiGroups:
82+
- kueue.x-k8s.io
83+
resources:
84+
- workloads/status
85+
verbs:
86+
- get
87+
- patch
88+
- update
89+
- apiGroups:
90+
- kubeflow.org
91+
resources:
92+
- tfjobs
93+
verbs:
94+
- create
95+
- delete
96+
- get
97+
- list
98+
- watch
99+
- apiGroups:
100+
- kubeflow.org
101+
resources:
102+
- tfjobs/status
103+
verbs:
104+
- get
105+
- apiGroups:
106+
- kubeflow.org
107+
resources:
108+
- paddlejobs
109+
verbs:
110+
- create
111+
- delete
112+
- get
113+
- list
114+
- watch
115+
- apiGroups:
116+
- kubeflow.org
117+
resources:
118+
- paddlejobs/status
119+
verbs:
120+
- get
121+
- apiGroups:
122+
- kubeflow.org
123+
resources:
124+
- pytorchjobs
125+
verbs:
126+
- create
127+
- delete
128+
- get
129+
- list
130+
- watch
131+
- apiGroups:
132+
- kubeflow.org
133+
resources:
134+
- pytorchjobs/status
135+
verbs:
136+
- get
137+
- apiGroups:
138+
- kubeflow.org
139+
resources:
140+
- xgboostjobs
141+
verbs:
142+
- create
143+
- delete
144+
- get
145+
- list
146+
- watch
147+
- apiGroups:
148+
- kubeflow.org
149+
resources:
150+
- xgboostjobs/status
151+
verbs:
152+
- get
153+
- apiGroups:
154+
- kubeflow.org
155+
resources:
156+
- mpijobs
157+
verbs:
158+
- create
159+
- delete
160+
- get
161+
- list
162+
- watch
163+
- apiGroups:
164+
- kubeflow.org
165+
resources:
166+
- mpijobs/status
167+
verbs:
168+
- get
169+
---
170+
apiVersion: rbac.authorization.k8s.io/v1
171+
kind: ClusterRoleBinding
172+
metadata:
173+
name: ${MULTIKUEUE_SA}-crb
174+
roleRef:
175+
apiGroup: rbac.authorization.k8s.io
176+
kind: ClusterRole
177+
name: ${MULTIKUEUE_SA}-role
178+
subjects:
179+
- kind: ServiceAccount
180+
name: ${MULTIKUEUE_SA}
181+
namespace: ${NAMESPACE}
182+
EOF
183+
184+
# Get or create a secret bound to the new service account.
185+
SA_SECRET_NAME=$(kubectl get -n ${NAMESPACE} sa/${MULTIKUEUE_SA} -o "jsonpath={.secrets[0]..name}")
186+
if [ -z "$SA_SECRET_NAME" ]; then
187+
kubectl apply -f - <<EOF
188+
apiVersion: v1
189+
kind: Secret
190+
type: kubernetes.io/service-account-token
191+
metadata:
192+
name: ${MULTIKUEUE_SA}
193+
namespace: ${NAMESPACE}
194+
annotations:
195+
kubernetes.io/service-account.name: "${MULTIKUEUE_SA}"
196+
EOF
197+
198+
SA_SECRET_NAME=${MULTIKUEUE_SA}
199+
fi
200+
201+
# Note: service account token is stored base64-encoded in the secret but must
202+
# be plaintext in kubeconfig.
203+
SA_TOKEN=$(kubectl get -n ${NAMESPACE} "secrets/${SA_SECRET_NAME}" -o "jsonpath={.data['token']}" | base64 -d)
204+
CA_CERT=$(kubectl get -n ${NAMESPACE} "secrets/${SA_SECRET_NAME}" -o "jsonpath={.data['ca\.crt']}")
205+
206+
# Extract cluster IP from the current context
207+
CURRENT_CONTEXT=$(kubectl config current-context)
208+
CURRENT_CLUSTER=$(kubectl config view -o jsonpath="{.contexts[?(@.name == \"${CURRENT_CONTEXT}\"})].context.cluster}")
209+
CURRENT_CLUSTER_ADDR=$(kubectl config view -o jsonpath="{.clusters[?(@.name == \"${CURRENT_CLUSTER}\"})].cluster.server}")
210+
211+
# Create the Kubeconfig file
212+
echo "Writing kubeconfig in ${KUBECONFIG_OUT}"
213+
cat >"${KUBECONFIG_OUT}" <<EOF
214+
apiVersion: v1
215+
clusters:
216+
- cluster:
217+
certificate-authority-data: ${CA_CERT}
218+
server: ${CURRENT_CLUSTER_ADDR}
219+
name: ${CURRENT_CLUSTER}
220+
contexts:
221+
- context:
222+
cluster: ${CURRENT_CLUSTER}
223+
user: ${CURRENT_CLUSTER}-${MULTIKUEUE_SA}
224+
name: ${CURRENT_CONTEXT}
225+
current-context: ${CURRENT_CONTEXT}
226+
kind: Config
227+
preferences: {}
228+
users:
229+
- name: ${CURRENT_CLUSTER}-${MULTIKUEUE_SA}
230+
user:
231+
token: ${SA_TOKEN}
232+
EOF

0 commit comments

Comments
 (0)