Skip to content

Commit 7d5bce9

Browse files
farshadghodsiansajmera-pensando
authored andcommitted
Added Slinky example to docs
1 parent b521a1d commit 7d5bce9

File tree

9 files changed

+1571
-0
lines changed

9 files changed

+1571
-0
lines changed

docs/slinky/slinky-example.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Example Training Workload via Slinky
2+
3+
The following outlines steps to get up and running with Slinky on Kubernetes and running a simple image classification training workload to verify GPUs are accessible.
4+
5+
## Clone this repo and go into slinky folder
6+
7+
```bash
8+
git clone https://github.com/rocm/gpu-operator.git
9+
cd example/slinky
10+
```
11+
12+
## Installing Slinky Prerequisites
13+
14+
The following steps for installing pre-requisites and installing Slinky have been taking from the SlinkProject/slinky-operator repo [quick-start guide](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md)
15+
16+
```bash
17+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
18+
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
19+
helm repo add bitnami https://charts.bitnami.com/bitnami
20+
helm repo add jetstack https://charts.jetstack.io
21+
helm repo update
22+
helm install cert-manager jetstack/cert-manager \
23+
--namespace cert-manager --create-namespace --set crds.enabled=true
24+
helm install prometheus prometheus-community/kube-prometheus-stack \
25+
--namespace prometheus --create-namespace --set installCRDs=true
26+
```
27+
28+
## Installing Slinky Operator
29+
30+
```bash
31+
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
32+
--values=values-operator.yaml --version=0.1.0 --namespace=slinky --create-namespace
33+
```
34+
35+
Make sure the operator deployed successfully with:
36+
37+
```sh
38+
kubectl --namespace=slinky get pods
39+
```
40+
41+
Output should be similar to:
42+
43+
```sh
44+
NAME READY STATUS RESTARTS AGE
45+
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
46+
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
47+
```
48+
49+
## Building the Slurm Compute Node Image
50+
51+
You will need to build a Slurm docker image to be used for the Slurm compute node that includes ROCm and ROCm-compatible PyTorch version. The slurm-rocm-torch directory contains an example Dockerfile that can be used to build this image. It is based off of the [Dockerfile from the Slinky repo](https://github.com/SlinkyProject/containers/blob/main/schedmd/slurm/24.05/ubuntu24.04/Dockerfile) with the only modifications being:
52+
53+
- the base image is using the `rocm/pytorch-training:v25.4` image which already has ROCm and PyTorch installed
54+
- the `COPY patches/ patches/` line has been commented out as there are currently no patches to be applied
55+
- the `COPY --from=build /tmp/*.deb /tmp/` has also been commented out as there are no .deb files to copy
56+
57+
58+
## Installing Slurm Cluster
59+
60+
Once the image has been built and pushed to a repository update the `values-slurm.yaml` file to specify the compute node image you will be using:
61+
62+
```yaml
63+
# Slurm compute (slurmd) configurations.
64+
compute:
65+
#
66+
# -- (string)
67+
# Set the image pull policy.
68+
imagePullPolicy: IfNotPresent
69+
#
70+
# Default image for the nodeset pod (slurmd)
71+
# Each nodeset may override this setting.
72+
image:
73+
#
74+
# -- (string)
75+
# Set the image repository to use.
76+
repository: docker-registry/docker-repository/docker-image
77+
#
78+
# -- (string)
79+
# Set the image tag to use.
80+
# @default -- The Release appVersion.
81+
tag: image-tag
82+
```
83+
84+
Install the Slurm Cluster helm chart
85+
86+
```bash
87+
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
88+
--values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace
89+
```
90+
91+
Make sure the Slurm cluster deployed successfully with:
92+
93+
```sh
94+
kubectl --namespace=slurm get pods
95+
```
96+
97+
Output should be similar to:
98+
99+
```sh
100+
NAME READY STATUS RESTARTS AGE
101+
slurm-accounting-0 1/1 Running 0 5m00s
102+
slurm-compute-gpu-node 1/1 Running 0 5m00s
103+
slurm-controller-0 2/2 Running 0 5m00s
104+
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
105+
slurm-mariadb-0 1/1 Running 0 5m00s
106+
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
107+
```
108+
109+
## Prepping Compute Node
110+
111+
1. Get SLURM Compute Node Name
112+
113+
```bash
114+
SLURM_COMPUTE_POD=$(kubectl get pods -n slurm | grep ^slurm-compute-gpu-node | awk '{print $1}');echo $SLURM_COMPUTE_POD
115+
```
116+
117+
2. Add Slurm user to video and render group and create Slurm user home directory to Slrum Compute node
118+
119+
```bash
120+
kubectl exec -it -n slurm $SLURM_COMPUTE_POD -- bash -c "
121+
usermod -aG video,render slurm
122+
mkdir -p /home/slurm
123+
chown slurm:slurm /home/slurm"
124+
```
125+
126+
3. Copy PyTorch test script to Slurm compute node that can be found in the `example/slinky` folder of this repo
127+
128+
```bash
129+
kubectl cp example/slinky/test.py slurm/$SLURM_COMPUTE_POD:/tmp/test.py
130+
```
131+
132+
4. Copy Fashion MNIST Image Classification Model Training script to Slurm compute node
133+
134+
```bash
135+
kubectl cp example/slinky/train_fashion_mnist.py slurm/$SLURM_COMPUTE_POD:/tmp/train_fashion_mnist.py
136+
```
137+
138+
5. Run test.py script on compute node to confirm GPUs are accessible
139+
140+
```bash
141+
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 test.py
142+
```
143+
144+
6. Run single-GPU training script on compute node
145+
146+
```bash
147+
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 train_fashion_mnist.py
148+
```
149+
150+
7. Run multi-GPU training script on compute node
151+
152+
```bash
153+
kubectl exec -it slurm-controller-0 -n slurm -- srun apptainer exec --rocm --bind /tmp:/tmp torch_rocm.sif torchrun --standalone --nnodes=1 --nproc_per_node=8 --master-addr localhost train_mnist_distributed.py
154+
```
155+
156+
## Other Useful Slurm Commands
157+
158+
### Check Slurm Node Info
159+
160+
```bash
161+
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
162+
```
163+
164+
### Check Job Queue
165+
166+
```bash
167+
kubectl exec -it slurm-controller-0 -n slurm -- squeue
168+
```
169+
170+
### Check Node Resources
171+
172+
```bash
173+
kubectl exec -it slurm-controller-0 -n slurm -- sinfo -N -o "%N %G"
174+
```

docs/sphinx/_toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ subtrees:
6161
entries:
6262
- file: specialized_networks/airgapped-install
6363
- file: specialized_networks/http-proxy
64+
- caption: Slurm on Kubernetes
65+
entries:
66+
- file: slinky/slinky-example
6467
- caption: Contributing
6568
entries:
6669
- file: contributing/developer-guide

docs/sphinx/_toc.yml.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ subtrees:
6161
entries:
6262
- file: specialized_networks/airgapped-install
6363
- file: specialized_networks/http-proxy
64+
- caption: Slurm on Kubernetes
65+
entries:
66+
- file: slinky/slinky-example
6467
- caption: Contributing
6568
entries:
6669
- file: contributing/developer-guide

0 commit comments

Comments
 (0)