You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
__External Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked.
71
+
__Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked.
83
72
84
73
## Summary
85
74
@@ -147,21 +136,20 @@ The responsibility to determine which machine should be rolled out as well as th
147
136
148
137
- Enable the implementation of pluggable update extensions.
149
138
- Allow users to update Kubernetes clusters using pluggable External Update Extension.
150
-
- Maintain a coherent user experience for both rolling and in-place updates.
151
139
- Support External Update Extensions for both Control Plane (KCP or others) and MachineDeployment controlled machines.
152
140
153
141
### Non-Goals/Future work
154
142
155
143
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
156
144
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs.
157
-
-Ammend the desired state to something that the registered updaters can cover or register additional updaters capable of handling the desired changes.
145
+
-Maintain a coherent user experience for both rolling and in-place updates.
158
146
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts (future goal).
159
147
160
148
## Proposal
161
149
162
-
We propose a pluggable update strategy architecture that allows the registration of External Update Extensions to optionally handle the update process.
150
+
We propose to extend upgrade workflows to call External Update Extensions, if defined.
163
151
164
-
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism. This means that any changes in behavior are controlled by the feature gate `InPlaceUpdates`, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured.
152
+
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism. All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true. It is disabled unless explicitly configured.
165
153
166
154
This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state.
167
155
@@ -404,14 +392,12 @@ We might explore the ability to represent this "dirty" state at the API level. W
404
392
405
393
Remediation can be used as the solution to recover machine when in-place update fails on a machine. The remediation process stays the same as today: the MachineHealthCheck controller monitors machine health status and marks it to be remediated based on pre-configured rules, then ControlPlane/MachineDeployment replaces the machine or call external remediation.
406
394
407
-
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, we require:
395
+
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, in a future iteration of this proposal we will consider:
408
396
* A mechanism to identify if a Machine is being updated. We will surface this in the Machine status. API details will be added later.
409
397
* A way to define different rules for Machines on-going an update. This might involve new fields in the MHC object. We will decouple these API changes from this proposal. For the first implementation of in-place updates, we might decide to just disable remediation for Machines that are on-going an update.
410
398
411
399
### Examples
412
400
413
-
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.*
414
-
415
401
This section aims to showcase our vision for the In-Places Updates end state. It shows a high level picture of a few common usecases, specially around how the different components interact through the API.
416
402
417
403
Note that these examples don't show all the low level details. Some of those details might not yet be defined in this doc and will be added later, the examples here are just to help communicate the vision.
@@ -420,6 +406,7 @@ Let's imagine a vSphere cluster with a KCP control plane that has two fictional
420
406
1.`vsphere-vm-memory-update`: The extension uses vSphere APIs to hot-add memory to VMs if "Memory Hot Add" is enabled or through a power cycle.
421
407
2.`kcp-version-upgrade`: Updates the kubernetes version of KCP machines by using an agent that first updates the kubernetes related packages (`kubeadm`, `kubectl`, etc.) and then runs the `kubeadm upgrade` command. The In-place Update extension communicates with this agent, sending instructions with the kubernetes version a machine needs to be updated to.
422
408
409
+
> Please note that exact Spec of messages will be defined during implementation; current examples are only meant to explain the flow.
423
410
424
411
#### KCP kubernetes version update
425
412
@@ -535,10 +522,8 @@ spec:
535
522
configRef:
536
523
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
537
524
kind: KubeadmConfig
538
-
- name: kcp-1-hfg374h-9wc29
539
-
- uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
540
-
+ name: kcp-1-hfg374h-flkf3
541
-
+ uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
525
+
name: kcp-1-hfg374h-9wc29
526
+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
542
527
status:
543
528
conditions:
544
529
+ - lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -630,8 +615,8 @@ spec:
630
615
configRef:
631
616
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
632
617
kind: KubeadmConfig
633
-
name: kcp-1-hfg374h-flkf3
634
-
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
618
+
name: kcp-1-hfg374h-9wc29
619
+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
635
620
status:
636
621
conditions:
637
622
- lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -652,8 +637,8 @@ spec:
652
637
configRef:
653
638
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
654
639
kind: KubeadmConfig
655
-
name: kcp-1-hfg374h-flkf3
656
-
uid: ddab8525-bb36-4a86-81e9-ef3eeeb33e18
640
+
name: kcp-1-hfg374h-9wc29
641
+
uid: fc69d363-272a-4b91-aa35-72ccdaa7a427
657
642
status:
658
643
conditions:
659
644
- - lastTransitionTime: "2024-12-31T23:50:00Z"
@@ -676,6 +661,7 @@ sequenceDiagram
676
661
participant capi as MD controller
677
662
participant msc as MachineSet Controller
678
663
participant mach as Machine Controller
664
+
participant hook as KCP version <br>update extension
679
665
participant hook as vSphere memory <br>update extension
680
666
end
681
667
@@ -768,9 +754,6 @@ The Machine controller then creates a new MachineSet with the new spec and moves
@@ -894,10 +877,6 @@ Both the `kcp-version-upgrade` and the `vsphere-vm-memory-update` extensions inf
894
877
895
878
Since the fallback to machine replacement is a default strategy and always enabled, the MachineDeployment controller proceeds with the rollout process as it does today, replacing the old machines with new ones.
896
879
897
-
### API Changes
898
-
899
-
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.*
900
-
901
880
### Security Model
902
881
903
882
On the core CAPI side, the security model for this feature is very straightforward: CAPI controllers only require to read/create/update CAPI resources and those controllers are the only ones that need to modify the CAPI resources. Moreover, the controllers that need to perform these actions already have the necessary permissions over the resources they need to modify.
@@ -906,7 +885,14 @@ However, each external updater should define their own security model. Depending
906
885
907
886
### Risks and Mitigations
908
887
909
-
1. One of the risks for this process could be that during a single node cluster in-place update, extension implementation might decline the update and that would result in falling back to rolling update strategy by default, which could possibly lead to breaking a cluster. For the first iteration, users must ensure that the changes they make will be accepted by their updater.
888
+
The main risk of this change is its complexity. This risk is mitigated by:
889
+
890
+
1. Implementing the feature in incremental steps.
891
+
892
+
2. Avoiding user-facing changes in the first iteration, allowing us to gather feedback and validate the core functionality before making changes that are difficult to revert.
893
+
894
+
3. Using a feature flag to control the availability of this functionality, ensuring it remains opt-in and can be disabled if issues arise.
0 commit comments