Skip to content

Commit 6cf0a63

Browse files
yiannistrig-gaston
authored andcommitted
Update proposal based on feedback
1 parent ad3312b commit 6cf0a63

File tree

1 file changed

+26
-61
lines changed

1 file changed

+26
-61
lines changed

docs/proposals/20240807-in-place-updates.md

+26-61
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,6 @@ __External Update Lifecycle Hook__: CAPI Lifecycle Runtime Hook to invoke extern
8181

8282
__External Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked.
8383

84-
__Marking Machine as Pending/Done__: Using the `sigs.k8s.io/cluster-api/internal/hooks.MarkAsPending()` and `sigs.k8s.io/cluster-api/internal/hooks.MarkAsDone()` functions to track that updaters should be called and to mark machine as done updating.
85-
8684
## Summary
8785

8886
The proposal introduces update extensions allowing users to execute custom strategies when performing Cluster API rollouts.
@@ -108,22 +106,20 @@ Over time several improvement were made to Cluster API immutable rollouts:
108106

109107
Even if the project continues to improve immutable rollouts, most probably there are and there will always be some remaining use cases where it is complex for users to perform immutable rollouts, or where users perceive immutable rollouts to be too disruptive to how they are used to manage machines in their organization:
110108
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time.
111-
* Single node cluster without extra hardware available.
112-
* `TODO: looking for more real life usecases here`
109+
* Credentials rotation, e.g. rotating authorized keys for SSH.
113110

114-
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems, allowing them to implement a custom rollout strategy to be triggered via a new external update extension point implemented using the existing runtime extension framework.
115111

116-
With the implementation of custom rollout strategy, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure.
112+
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems by implementing an Update extension.
117113

118-
### Divide and conquer
114+
With the implementation of an Update extension, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure.
119115

120-
As this proposal is an output of the In-place updates Feature Group, ensuring that the external update extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort.
116+
### Divide and conquer
121117

122-
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal).
118+
Considering the complexity of this topic, a phased approach is required to design and implement the solution for in-place upgrades.
123119

124-
Another important point to surface, before digging into implementation details of the proposal, is the fact that this proposal is not tackling the problem of improving CAPI to embrace all the possibilities that external update extensions are introducing. E.g. If an external update extension introduces support for in-place updates, using “BootstrapConfig” (emphasis on bootstrap) as the place where most of the machine configurations are defined seems not ideal.
120+
The main goal of the first iteration of this proposal is to make it possible for Cluster API users to start experimenting usage of in-place upgrades, so we can gather feedback and evolve to the next stage.
125121

126-
However, at the same time we would like to make it possible for Cluster API users to start exploring this field, gain experience, and report back so we can have concrete use cases and real-world feedback to evolve our API.
122+
This iteration will focus on implementing the machinery required to interact with update extensions, while users facing changes in the API types are deferred to follow up iterations.
127123

128124
### Tenets
129125

@@ -136,36 +132,36 @@ Cluster API user experience MUST be the same when using default, immutable updat
136132
If external update extensions can not cover the totality of the desired changes, CAPI WILL defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
137133

138134
* It allows to implement custom rollout strategies incrementally, without the need to cover all use cases up-front.
139-
* There are case when replacing the machine will always be necessary:
135+
* There are cases when replacing the machine will always be necessary:
140136
* When it is not possible to recover the machine, e.g. hardware failure.
141137
* When the user determines that recovering the machine is too complex/costly vs replacing it.
142138
* Automatic machine remediation (unless you use external remediation strategies)
143139

144140
#### Clean separation of concern
145141

146-
The external update extension will be responsible to perform the updates on a single machine.
142+
It is the responsibility of the extension to decide if it can perform changes in-place and to perform these changes on a single machine. If the extension decides that it cannot perform changes in-place, CAPI will fall back to rollout.
147143

148-
The responsibility to determine which machine should be rolled out as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
144+
The responsibility to determine which machine should be rolled out as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
149145

150146
### Goals
151147

152-
- Enable the implementation of in-place update strategies.
148+
- Enable the implementation of pluggable update extensions.
153149
- Allow users to update Kubernetes clusters using pluggable External Update Extension.
154150
- Maintain a coherent user experience for both rolling and in-place updates.
155151
- Support External Update Extensions for both Control Plane (KCP or others) and MachineDeployment controlled machines.
156-
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts.
157152

158153
### Non-Goals/Future work
159154

160155
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
161156
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs.
162157
- Ammend the desired state to something that the registered updaters can cover or register additional updaters capable of handling the desired changes.
158+
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts.
163159

164160
## Proposal
165161

166-
We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process.
162+
We propose to extend upgrade workflows to call External Update Extensions, if defined.
167163

168-
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism and be contained within the experimental package. This means that any changes in behavior are controlled by the feature gate `InPlaceUpdates`, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured.
164+
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism. This means that any changes in behavior are controlled by the feature gate `InPlaceUpdates`, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured.
169165

170166
This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state.
171167

@@ -175,26 +171,23 @@ With the introduction of this experimental feature, users may want to apply the
175171

176172
#### Story 1
177173

178-
As an cluster operator, I want to perform in-place updates on my Kubernetes clusters without replacing the underlying machines. I expect the update process to be flexible, allowing me to customize the strategy based on my specific requirements, such as air-gapped environments or special node configurations.
174+
As a cluster operator, I want to perform in-place updates on my Kubernetes clusters without replacing the underlying machines. I expect the update process to be flexible, allowing me to customize the strategy based on my specific requirements, such as air-gapped environments or special node configurations.
179175

180176
#### Story 2
181177

182178
As a cluster operator, I want to seamlessly transition between rolling and in-place updates while maintaining a consistent user interface. I appreciate the option to choose or implement my own update strategy, ensuring that the update process aligns with my organization's unique needs.
183179

184180
#### Story 3
185-
As an cluster operator for resource constrained environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without requiring additional compute capacity in a single node cluster.
181+
As a cluster operator for resource constrained environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without requiring additional compute capacity in a single node cluster.
186182

187183
#### Story 4
188-
As an cluster operator for highly specialized/customized environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without losing the existing VM/OS customizations.
184+
As a cluster operator for highly specialized/customized environments, I want to utilize CAPI pluggable external update mechanism to implement in-place updates without losing the existing VM/OS customizations.
189185

190186
#### Story 5
191187
As a cluster operator, I want to update machine attributes supported by my infrastructure provider without the need to recreate the machine.
192188

193189
#### Story 6
194-
As a cluster service provider, I want guidance/documentation on how to write external update extension for own my use case.
195-
196-
#### Story 7
197-
As a bootstrap/controlplane provider developer, I want guidance/documentation on how to reuse some parts of this pluggable external update mechanism.
190+
As a cluster service provider, I want guidance/documentation on how to write external update extension for my own use case.
198191

199192
### High level flow
200193

@@ -231,7 +224,7 @@ sequenceDiagram
231224

232225
When configured, external updates will, roughly, follow these steps:
233226
1. CP/MD Controller: detect an update is required.
234-
2. CP/MD Controller: query defined update extensions, and based on the response decides if an update should happen in-place.
227+
2. CP/MD Controller: query defined update extensions, and based on the response decides if an update should happen in-place. If not, the update will be performed as of today (rollout).
235228
3. CP/MD Controller: mark machines as pending using `sigs.k8s.io/cluster-api/internal/hooks.MarkAsPending()` function to track that updaters should be called.
236229
4. Machine Controller: set `UpToDate` condition on machines to `False`.
237230
5. Machine Controller: invoke all registered updaters, sequentially, one by one.
@@ -312,14 +305,12 @@ end
312305

313306
The MachineDeployment controller updates machines in place in a very similar way to rolling updates: by creating a new MachineSet and moving the machines from the old MS to the new one. We want to stress that the Machine objects won't be deleted and recreated like in the current rolling strategy. The MachineDeployment will just update the OwnerRefs and labels, effectively moving the existing Machine object from one MS to another. The number of machines moved at once might be made configurable on the MachineDeployment in the same way `maxSurge` and `maxUnavailable` control this for rolling updates.
314307

315-
When the new MachineSet controller sees a new Machine with an outdated spec, it updates the spec to match the one in the MS. This update together with marking machine as pending and setting a condition is what triggers the Machine controller to start executing
316-
the external updaters.
317-
318308
### KCP updates
319309

320310
```mermaid
321311
sequenceDiagram
322312
box Management Cluster
313+
participant Operator
323314
participant apiserver as kube-api server
324315
participant capi as KCP controller
325316
participant mach as Machine Controller
@@ -391,14 +382,16 @@ Once a Machine is marked as pending and `UpToDate` condition is set and the Mach
391382

392383
The Machine controller currently calls registered external updaters sequentially but without a defined order. We are explicitly not trying to design a solution for ordering of execution at this stage. However, determining a specific ordering mechanism or dependency management between update extensions will need to be addressed in future iterations of this proposal.
393384

394-
The controller will trigger updaters by hitting another RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until it reaches a terminal state.
385+
The controller will trigger updaters by hitting a RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until it reaches a terminal state.
395386

396-
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update reaches a terminal state.
387+
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update either completes or fails.
397388

398389
Once all of the updaters are complete, the Machine controller will mark machine as done. If the update fails, this will be reflected in the Machine status.
399390

400391
From this point on, the `KCP` or `MachineDeployment` controller will take over and set the `UpToDate` condition to `True`.
401392

393+
Note: We might revisit which controller should set `UpToDate` during implementation, because we have to make sure there are no race conditions that can lead to reconcile failures, but apart from the ownership of this operation, the workflows described in this doc should not be impacted.
394+
402395
### Infra Machine Template changes
403396

404397
As mentioned before, the user experience to update in-place should be the exact same one as for rolling updates. This includes the need to rotate the Infra machine template. For providers that bundle the kubernetes components in some kind of image, this means that when upgrading kubernetes versions, a new image will be required.
@@ -586,7 +579,7 @@ When the `kcp-version-upgrade` extension receives the request, it verifies it ca
586579
{
587580
"error": null,
588581
"status": "InProgress",
589-
"tryAgain": "5m0s"
582+
"retryAfterSeconds": "5m0s"
590583
}
591584
```
592585

@@ -672,7 +665,7 @@ status:
672665
type: UpToDate
673666
```
674667

675-
This process is repeated a third time with the last KCP machine, finally marking the KCP object as up to date.
668+
This process is repeated for the second and third KCP machine, finally marking the KCP object as up to date.
676669

677670
#### Update worker node memory
678671

@@ -903,34 +896,6 @@ Both the `kcp-version-upgrade` and the `vsphere-vm-memory-update` extensions inf
903896

904897
Since the fallback to machine replacement is a default strategy and always enabled, the MachineDeployment controller proceeds with the rollout process as it does today, replacing the old machines with new ones.
905898

906-
### API Changes
907-
908-
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.*
909-
910-
#### External Update RuntimeExtension
911-
912-
> TODO: we will add this later, after we get feedback from the first daft
913-
914-
##### `CanUpdateMachine` endpoint
915-
##### Request
916-
> Requirements:
917-
> * Desired Machine/Bootstrap/InfraMachine changes
918-
919-
##### Response
920-
> Requirements:
921-
> * Set of supported changes, probably an array of strings (the path in the object)
922-
> * Error
923-
924-
##### `UpdateMachine` endpoint
925-
##### Request
926-
> Requirements:
927-
> * Machine reference - namespace, name
928-
929-
##### Response
930-
> Requirements:
931-
> * Result: [Success/Error/InProgress]
932-
> * Retry in X seconds
933-
934899
### Security Model
935900

936901
On the core CAPI side, the security model for this feature is very straightforward: CAPI controllers only require to read/create/update CAPI resources and those controllers are the only ones that need to modify the CAPI resources. Moreover, the controllers that need to perform these actions already have the necessary permissions over the resources they need to modify.

0 commit comments

Comments
 (0)