Skip to content

Commit 1e47525

Browse files
committed
Address review comments
1 parent f96435f commit 1e47525

File tree

1 file changed

+56
-70
lines changed

1 file changed

+56
-70
lines changed

docs/proposals/20240807-in-place-updates.md

+56-70
Original file line numberDiff line numberDiff line change
@@ -22,52 +22,41 @@ status: experimental
2222
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
2323
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
2424

25-
- [In-place updates in Cluster API](#in-place-updates-in-cluster-api)
26-
- [Table of Contents](#table-of-contents)
27-
- [Glossary](#glossary)
28-
- [Summary](#summary)
29-
- [Motivation](#motivation)
30-
- [Divide and conquer](#divide-and-conquer)
31-
- [Tenets](#tenets)
32-
- [Same UX](#same-ux)
33-
- [Fallback to Immutable rollouts](#fallback-to-immutable-rollouts)
34-
- [Clean separation of concern](#clean-separation-of-concern)
35-
- [Goals](#goals)
36-
- [Non-Goals/Future work](#non-goalsfuture-work)
37-
- [Proposal](#proposal)
38-
- [User Stories](#user-stories)
39-
- [Story 1](#story-1)
40-
- [Story 2](#story-2)
41-
- [Story 3](#story-3)
42-
- [Story 4](#story-4)
43-
- [Story 5](#story-5)
44-
- [Story 6](#story-6)
45-
- [Story 7](#story-7)
46-
- [High level flow](#high-level-flow)
47-
- [Deciding the update strategy](#deciding-the-update-strategy)
48-
- [MachineDeployment updates](#machinedeployment-updates)
49-
- [KCP updates](#kcp-updates)
50-
- [Machine updates](#machine-updates)
51-
- [Infra Machine Template changes](#infra-machine-template-changes)
52-
- [Remediation](#remediation)
53-
- [Examples](#examples)
54-
- [KCP kubernetes version update](#kcp-kubernetes-version-update)
55-
- [Update worker node memory](#update-worker-node-memory)
56-
- [Update worker nodes OS from Linux to Windows](#update-worker-nodes-os-from-linux-to-windows)
57-
- [API Changes](#api-changes)
58-
- [External Update RuntimeExtension](#external-update-runtimeextension)
59-
- [`CanUpdateMachine` endpoint](#canupdatemachine-endpoint)
60-
- [Request](#request)
61-
- [Response](#response)
62-
- [`UpdateMachine` endpoint](#updatemachine-endpoint)
63-
- [Request](#request-1)
64-
- [Response](#response-1)
65-
- [Security Model](#security-model)
66-
- [Risks and Mitigations](#risks-and-mitigations)
67-
- [Additional Details](#additional-details)
68-
- [Test Plan](#test-plan)
69-
- [Graduation Criteria](#graduation-criteria)
70-
- [Implementation History](#implementation-history)
25+
- [Glossary](#glossary)
26+
- [Summary](#summary)
27+
- [Motivation](#motivation)
28+
- [Divide and conquer](#divide-and-conquer)
29+
- [Tenets](#tenets)
30+
- [Same UX](#same-ux)
31+
- [Fallback to Immutable rollouts](#fallback-to-immutable-rollouts)
32+
- [Clean separation of concern](#clean-separation-of-concern)
33+
- [Goals](#goals)
34+
- [Non-Goals/Future work](#non-goalsfuture-work)
35+
- [Proposal](#proposal)
36+
- [User Stories](#user-stories)
37+
- [Story 1](#story-1)
38+
- [Story 2](#story-2)
39+
- [Story 3](#story-3)
40+
- [Story 4](#story-4)
41+
- [Story 5](#story-5)
42+
- [Story 6](#story-6)
43+
- [High level flow](#high-level-flow)
44+
- [Deciding the update strategy](#deciding-the-update-strategy)
45+
- [MachineDeployment updates](#machinedeployment-updates)
46+
- [KCP updates](#kcp-updates)
47+
- [Machine updates](#machine-updates)
48+
- [Infra Machine Template changes](#infra-machine-template-changes)
49+
- [Remediation](#remediation)
50+
- [Examples](#examples)
51+
- [KCP kubernetes version update](#kcp-kubernetes-version-update)
52+
- [Update worker node memory](#update-worker-node-memory)
53+
- [Update worker nodes OS from Linux to Windows](#update-worker-nodes-os-from-linux-to-windows)
54+
- [Security Model](#security-model)
55+
- [Risks and Mitigations](#risks-and-mitigations)
56+
- [Additional Details](#additional-details)
57+
- [Test Plan](#test-plan)
58+
- [Graduation Criteria](#graduation-criteria)
59+
- [Implementation History](#implementation-history)
7160

7261
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
7362

@@ -79,7 +68,7 @@ __In-place Update__: any change to a Machine spec, including the Kubernetes Vers
7968

8069
__External Update Lifecycle Hook__: CAPI Lifecycle Runtime Hook to invoke external update extensions.
8170

82-
__External Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked.
71+
__Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked.
8372

8473
## Summary
8574

@@ -147,21 +136,20 @@ The responsibility to determine which machine should be rolled out as well as th
147136

148137
- Enable the implementation of pluggable update extensions.
149138
- Allow users to update Kubernetes clusters using pluggable External Update Extension.
150-
- Maintain a coherent user experience for both rolling and in-place updates.
151139
- Support External Update Extensions for both Control Plane (KCP or others) and MachineDeployment controlled machines.
152140

153141
### Non-Goals/Future work
154142

155143
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
156144
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs.
157-
- Ammend the desired state to something that the registered updaters can cover or register additional updaters capable of handling the desired changes.
145+
- Maintain a coherent user experience for both rolling and in-place updates.
158146
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts (future goal).
159147

160148
## Proposal
161149

162-
We propose a pluggable update strategy architecture that allows the registration of External Update Extensions to optionally handle the update process.
150+
We propose to extend upgrade workflows to call External Update Extensions, if defined.
163151

164-
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism. This means that any changes in behavior are controlled by the feature gate `InPlaceUpdates`, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured.
152+
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism. All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true. It is disabled unless explicitly configured.
165153

166154
This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state.
167155

@@ -404,14 +392,12 @@ We might explore the ability to represent this "dirty" state at the API level. W
404392

405393
Remediation can be used as the solution to recover machine when in-place update fails on a machine. The remediation process stays the same as today: the MachineHealthCheck controller monitors machine health status and marks it to be remediated based on pre-configured rules, then ControlPlane/MachineDeployment replaces the machine or call external remediation.
406394

407-
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, we require:
395+
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, in a future iteration of this proposal we will consider:
408396
* A mechanism to identify if a Machine is being updated. We will surface this in the Machine status. API details will be added later.
409397
* A way to define different rules for Machines on-going an update. This might involve new fields in the MHC object. We will decouple these API changes from this proposal. For the first implementation of in-place updates, we might decide to just disable remediation for Machines that are on-going an update.
410398

411399
### Examples
412400

413-
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.*
414-
415401
This section aims to showcase our vision for the In-Places Updates end state. It shows a high level picture of a few common usecases, specially around how the different components interact through the API.
416402

417403
Note that these examples don't show all the low level details. Some of those details might not yet be defined in this doc and will be added later, the examples here are just to help communicate the vision.
@@ -420,6 +406,7 @@ Let's imagine a vSphere cluster with a KCP control plane that has two fictional
420406
1. `vsphere-vm-memory-update`: The extension uses vSphere APIs to hot-add memory to VMs if "Memory Hot Add" is enabled or through a power cycle.
421407
2. `kcp-version-upgrade`: Updates the kubernetes version of KCP machines by using an agent that first updates the kubernetes related packages (`kubeadm`, `kubectl`, etc.) and then runs the `kubeadm upgrade` command. The In-place Update extension communicates with this agent, sending instructions with the kubernetes version a machine needs to be updated to.
422408

409+
> Please note that exact Spec of messages will be defined during implementation; current examples are only meant to explain the flow.
423410
424411
#### KCP kubernetes version update
425412

@@ -535,10 +522,8 @@ spec:
535522
configRef:
536523
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
537524
kind: KubeadmConfig
538-
- name: kcp-1-hfg374h-9wc29
539-
- uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
540-
+ name: kcp-1-hfg374h-flkf3
541-
+ uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
525+
name: kcp-1-hfg374h-9wc29
526+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
542527
status:
543528
conditions:
544529
+ - lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -630,8 +615,8 @@ spec:
630615
configRef:
631616
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
632617
kind: KubeadmConfig
633-
name: kcp-1-hfg374h-flkf3
634-
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
618+
name: kcp-1-hfg374h-9wc29
619+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
635620
status:
636621
conditions:
637622
- lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -652,8 +637,8 @@ spec:
652637
configRef:
653638
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
654639
kind: KubeadmConfig
655-
name: kcp-1-hfg374h-flkf3
656-
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
640+
name: kcp-1-hfg374h-9wc29
641+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
657642
status:
658643
conditions:
659644
- - lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -676,6 +661,7 @@ sequenceDiagram
676661
participant capi as MD controller
677662
participant msc as MachineSet Controller
678663
participant mach as Machine Controller
664+
participant hook as KCP version <br>update extension
679665
participant hook as vSphere memory <br>update extension
680666
end
681667
@@ -768,9 +754,6 @@ The Machine controller then creates a new MachineSet with the new spec and moves
768754
apiVersion: cluster.x-k8s.io/v1beta1
769755
kind: Machine
770756
metadata:
771-
+ labels:
772-
+ cluster.x-k8s.io/cluster-name: cluster1
773-
+ cluster.x-k8s.io/deployment-name: md-1
774757
+ annotations:
775758
+ runtime.cluster.x-k8s.io/pending-hooks: ExternalUpdate
776759
name: md-1-6bp6g
@@ -894,10 +877,6 @@ Both the `kcp-version-upgrade` and the `vsphere-vm-memory-update` extensions inf
894877

895878
Since the fallback to machine replacement is a default strategy and always enabled, the MachineDeployment controller proceeds with the rollout process as it does today, replacing the old machines with new ones.
896879

897-
### API Changes
898-
899-
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.*
900-
901880
### Security Model
902881

903882
On the core CAPI side, the security model for this feature is very straightforward: CAPI controllers only require to read/create/update CAPI resources and those controllers are the only ones that need to modify the CAPI resources. Moreover, the controllers that need to perform these actions already have the necessary permissions over the resources they need to modify.
@@ -906,7 +885,14 @@ However, each external updater should define their own security model. Depending
906885

907886
### Risks and Mitigations
908887

909-
1. One of the risks for this process could be that during a single node cluster in-place update, extension implementation might decline the update and that would result in falling back to rolling update strategy by default, which could possibly lead to breaking a cluster. For the first iteration, users must ensure that the changes they make will be accepted by their updater.
888+
The main risk of this change is its complexity. This risk is mitigated by:
889+
890+
1. Implementing the feature in incremental steps.
891+
892+
2. Avoiding user-facing changes in the first iteration, allowing us to gather feedback and validate the core functionality before making changes that are difficult to revert.
893+
894+
3. Using a feature flag to control the availability of this functionality, ensuring it remains opt-in and can be disabled if issues arise.
895+
910896

911897
## Additional Details
912898

0 commit comments

Comments
 (0)