Skip to content

Commit 739cc70

Browse files
yansun1996sajmera-pensando
authored andcommitted
[DOC] Misc fix on helm installation description
1 parent 7d5bce9 commit 739cc70

File tree

4 files changed

+256
-31
lines changed

4 files changed

+256
-31
lines changed

docs/installation/kubernetes-helm.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -118,19 +118,21 @@ To install the latest version of the GPU Operator run the following Helm install
118118
```bash
119119
helm install amd-gpu-operator rocm/gpu-operator-charts \
120120
--namespace kube-amd-gpu \
121-
--create-namespace
121+
--create-namespace \
122122
--version=v1.3.0
123123
```
124124

125125
```{note}
126126
Installation Options
127127
- Skip NFD installation: `--set node-feature-discovery.enabled=false`
128-
- Skip KMM installation: `--set kmm.enabled=false`
128+
- Skip KMM installation: `--set kmm.enabled=false`. <br> Although KMM is a [Kubernetes-SIGs](https://github.com/kubernetes-sigs) maintained project, it is strongly recommended to use AMD optimized and published KMM images included in each operator release.
129129
- Disable default DeviceConfig installation: `--set crds.defaultCR.install=false`
130130
```
131131

132-
```{warning}
133-
It is strongly recommended to use AMD-optimized KMM images included in the operator release.
132+
```{tip}
133+
1. Before v1.3.0 the gpu operator helm chart won't provide a default ```DeviceConfig```, you need to take extra step to create a ```DeviceConfig```.
134+
135+
2. Starting from v1.3.0 the ```helm install``` command would support one-step installation + configuration, which would create a default ```DeviceConfig``` with default values, which may not work for all the users with different the deployment scenarios, please refer to {ref}`typical-deployment-scenarios` for more information and get corresponding ```helm install``` commands.
134136
```
135137

136138
### 3. Helm Chart Customization Parameters
@@ -471,7 +473,7 @@ kubectl get modules -n kube-amd-gpu
471473
- Check NFD status:
472474

473475
```bash
474-
kubectl get nodefeatures -n kube-amd-gpu
476+
kubectl get nodefeaturerules -n kube-amd-gpu
475477
```
476478

477479
For more detailed troubleshooting steps, see our [Troubleshooting Guide](../troubleshooting).

docs/slinky/slinky-example.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,14 @@ cd example/slinky
1111

1212
## Installing Slinky Prerequisites
1313

14+
Install AMD GPU Operator, configure the `DeviceConfig` and make sure that the device plugin is advertising the AMD GPU devices as allocatable resources
15+
16+
```bash
17+
$ kubectl get node -oyaml | grep -i allocatable -A 10 | grep amd.com
18+
19+
amd.com/gpu: "8"
20+
```
21+
1422
The following steps for installing pre-requisites and installing Slinky have been taking from the SlinkProject/slinky-operator repo [quick-start guide](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md)
1523

1624
```bash

docs/troubleshooting.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,21 @@ To collect logs from the AMD GPU Operator:
1818
kubectl logs -n kube-amd-gpu <pod-name>
1919
```
2020

21+
22+
## Potential Issues with default ``DeviceConfig``
23+
24+
* Please refer to [Typical Deployment Scenarios](../usage.html#typical-deployment-scenarios) for more information and get corresponding ```helm install``` commands and configs that fits your specific use case.
25+
26+
* If operand pods (e.g. device plugin, metrics exporter) are stuck in ``Init:0/1`` state, it means your GPU worker doesn't have GPU driver loaded or driver was not loaded properly.
27+
28+
* If you try to use inbox or pre-installed driver please check the node ``dmesg`` to see why the driver was not loaded properly.
29+
30+
* If you want to deploy out-of-tree driver, we suggest check the `Driver Installation Guide <./drivers/installation.html>`_ then modify the default ``DeviceConfig`` to ask Operator to install the out-of-tree GPU driver for your worker nodes.
31+
32+
```bash
33+
kubectl edit deviceconfigs -n kube-amd-gpu default
34+
```
35+
2136
## Debugging Driver Installation
2237

2338
If the AMD GPU driver build fails:
@@ -42,7 +57,7 @@ kubectl get events -n kube-amd-gpu
4257

4358
## Using Techsupport-dump Tool
4459

45-
The techsupport-dump tool can be used to collect system state and logs for debugging:
60+
The [techsupport-dump script](https://github.com/ROCm/gpu-operator/blob/main/tools/techsupport_dump.sh) can be used to collect system state and logs for debugging:
4661

4762
```bash
4863
./tools/techsupport_dump.sh [-w] [-o yaml/json] [-k kubeconfig] <node-name/all>
@@ -53,3 +68,5 @@ Options:
5368
- `-w`: wide option
5469
- `-o yaml/json`: output format (default: json)
5570
- `-k kubeconfig`: path to kubeconfig (default: ~/.kube/config)
71+
72+
Please file an issue with collected techsupport bundle on our [GitHub Issues](https://github.com/ROCm/gpu-operator/issues) page

0 commit comments

Comments
 (0)