Skip to content

Commit 0ce35f4

Browse files
sriram-30sajmera-pensando
authored andcommitted
Retry failed nodes for upgrade
1 parent 723d6b4 commit 0ce35f4

File tree

3 files changed

+268
-26
lines changed

3 files changed

+268
-26
lines changed

docs/drivers/upgrading.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,13 @@ The following are considered during the automatic upgrade process
133133

134134
1. Selection of a node should satisfy both `maxUnavailableNodes` and `maxParallelUpgrades` criteria
135135
2. All nodes in failed state is considered while calculating `maxUnavailableNodes`
136-
3. When a driver upgrade on a node fails, the node will be in cordoned state. User has to fix the issue and uncordon the node manually. Such nodes will be automatically picked up for automatic driver upgrade operation.
136+
137+
### 3. Recovery From Upgrade Failure
138+
139+
If it is observed that the upgrade status is in failed state for a specific node, the user can debug the node, fix it and then add this label to the node to restart upgrade on it. The upgrade state will be reset and it can be tracked as it was before
140+
141+
- Command: `kubectl label node <nodename> operator.amd.com/gpu-driver-upgrade-state=upgrade-required`
142+
- Label: `operator.amd.com/gpu-driver-upgrade-state: upgrade-required`
137143

138144
## 2. Manual Upgrade Process
139145

internal/controllers/mock_upgrademgr.go

Lines changed: 74 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)