Skip to content

✨ Support RKE2ControlPlane mhc remediation #627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ docker-build: buildx-machine docker-pull-prerequisites


.PHONY: docker-build-rke2-bootstrap
docker-build-rke2-bootstrap:
docker-build-rke2-bootstrap:
DOCKER_BUILDKIT=1 BUILDX_BUILDER=$(MACHINE) docker buildx build \
--platform $(ARCH) \
--load \
Expand Down Expand Up @@ -395,7 +395,7 @@ kubectl: # Download kubectl cli into tools bin folder
##@ e2e:

# Allow overriding the e2e configurations
GINKGO_FOCUS ?= Workload cluster creation
GINKGO_FOCUS ?=
GINKGO_SKIP ?= API Version Upgrade
GINKGO_NODES ?= 1
GINKGO_NOCOLOR ?= false
Expand Down
4 changes: 4 additions & 0 deletions controlplane/api/v1alpha1/conversion.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ func (src *RKE2ControlPlane) ConvertTo(dstRaw conversion.Hub) error {
dst.Spec.AgentConfig.PodSecurityAdmissionConfigFile = restored.Spec.AgentConfig.PodSecurityAdmissionConfigFile
}

if restored.Spec.RemediationStrategy != nil {
dst.Spec.RemediationStrategy = restored.Spec.RemediationStrategy
}

dst.Spec.ServerConfig.EmbeddedRegistry = restored.Spec.ServerConfig.EmbeddedRegistry
dst.Spec.MachineTemplate = restored.Spec.MachineTemplate
dst.Status = restored.Status
Expand Down
2 changes: 2 additions & 0 deletions controlplane/api/v1alpha1/zz_generated.conversion.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

91 changes: 91 additions & 0 deletions controlplane/api/v1beta1/rke2controlplane_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ limitations under the License.
package v1beta1

import (
"time"

corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
Expand All @@ -40,6 +42,23 @@ const (
// LegacyRKE2ControlPlane is a controlplane annotation that marks the CP as legacy. This CP will not provide
// etcd certificate management or etcd membership management.
LegacyRKE2ControlPlane = "controlplane.cluster.x-k8s.io/legacy"

// RemediationInProgressAnnotation is used to keep track that a RCP remediation is in progress, and more
// specifically it tracks that the system is in between having deleted an unhealthy machine and recreating its replacement.
// NOTE: if something external to CAPI removes this annotation the system cannot detect the above situation; this can lead to
// failures in updating remediation retry or remediation count (both counters restart from zero).
RemediationInProgressAnnotation = "controlplane.cluster.x-k8s.io/remediation-in-progress"

// RemediationForAnnotation is used to link a new machine to the unhealthy machine it is replacing;
// please note that in case of retry, when also the remediating machine fails, the system keeps track of
// the first machine of the sequence only.
// NOTE: if something external to CAPI removes this annotation the system this can lead to
// failures in updating remediation retry (the counter restarts from zero).
RemediationForAnnotation = "controlplane.cluster.x-k8s.io/remediation-for"

// DefaultMinHealthyPeriod defines the default minimum period before we consider a remediation on a
// machine unrelated from the previous remediation.
DefaultMinHealthyPeriod = 1 * time.Hour
)

// RKE2ControlPlaneSpec defines the desired state of RKE2ControlPlane.
Expand Down Expand Up @@ -98,6 +117,10 @@ type RKE2ControlPlaneSpec struct {

// The RolloutStrategy to use to replace control plane machines with new ones.
RolloutStrategy *RolloutStrategy `json:"rolloutStrategy"`

// remediationStrategy is the RemediationStrategy that controls how control plane machine remediation happens.
// +optional
RemediationStrategy *RemediationStrategy `json:"remediationStrategy,omitempty"`
}

// RKE2ControlPlaneMachineTemplate defines the template for Machines
Expand Down Expand Up @@ -265,6 +288,10 @@ type RKE2ControlPlaneStatus struct {
// AvailableServerIPs is a list of the Control Plane IP adds that can be used to register further nodes.
// +optional
AvailableServerIPs []string `json:"availableServerIPs,omitempty"`

// lastRemediation stores info about last remediation performed.
// +optional
LastRemediation *LastRemediationStatus `json:"lastRemediation,omitempty"`
}

// +kubebuilder:object:root=true
Expand Down Expand Up @@ -423,6 +450,70 @@ const (
SnapshotValidationWebhook DisabledPluginComponent = "rke2-snapshot-validation-webhook"
)

// RemediationStrategy allows to define how control plane machine remediation happens.
type RemediationStrategy struct {
// maxRetry is the Max number of retries while attempting to remediate an unhealthy machine.
// A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
// For example, given a control plane with three machines M1, M2, M3:
//
// M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
// If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
// remediated; such operation is considered a retry, remediation-retry #1.
// If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
//
// A retry could happen only after RetryPeriod from the previous retry.
// If a machine is marked as unhealthy after MinHealthyPeriod from the previous remediation expired,
// this is not considered a retry anymore because the new issue is assumed unrelated from the previous one.
//
// If not set, the remedation will be retried infinitely.
// +optional
MaxRetry *int32 `json:"maxRetry,omitempty"`

// retryPeriod is the duration that RKE2ControlPlane should wait before remediating a machine being created as a replacement
// for an unhealthy machine (a retry).
//
// If not set, a retry will happen immediately.
// +optional
RetryPeriod metav1.Duration `json:"retryPeriod,omitempty"`

// minHealthyPeriod defines the duration after which RKE2ControlPlane will consider any failure to a machine unrelated
// from the previous one. In this case the remediation is not considered a retry anymore, and thus the retry
// counter restarts from 0. For example, assuming MinHealthyPeriod is set to 1h (default)
//
// M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
// If M1-1 (replacement of M1) has problems within the 1hr after the creation, also
// this machine will be remediated and this operation is considered a retry - a problem related
// to the original issue happened to M1 -.
//
// If instead the problem on M1-1 is happening after MinHealthyPeriod expired, e.g. four days after
// m1-1 has been created as a remediation of M1, the problem on M1-1 is considered unrelated to
// the original issue happened to M1.
//
// If not set, this value is defaulted to 1h.
// +optional
MinHealthyPeriod *metav1.Duration `json:"minHealthyPeriod,omitempty"`
}

// LastRemediationStatus stores info about last remediation performed.
// NOTE: if for any reason information about last remediation are lost, RetryCount is going to restart from 0 and thus
// more remediations than expected might happen.
type LastRemediationStatus struct {
// machine is the machine name of the latest machine being remediated.
// +required
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=253
Machine string `json:"machine"`

// timestamp is when last remediation happened. It is represented in RFC3339 form and is in UTC.
// +required
Timestamp metav1.Time `json:"timestamp"`

// retryCount used to keep track of remediation retry for the last remediated machine.
// A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
// +required
RetryCount int `json:"retryCount"`
}

// RolloutStrategy describes how to replace existing machines
// with new ones.
type RolloutStrategy struct {
Expand Down
52 changes: 52 additions & 0 deletions controlplane/api/v1beta1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -1952,6 +1952,54 @@ spec:
- control-plane-endpoint
- ""
type: string
remediationStrategy:
description: remediationStrategy is the RemediationStrategy that controls
how control plane machine remediation happens.
properties:
maxRetry:
description: "maxRetry is the Max number of retries while attempting
to remediate an unhealthy machine.\nA retry happens when a machine
that was created as a replacement for an unhealthy machine also
fails.\nFor example, given a control plane with three machines
M1, M2, M3:\n\n\tM1 become unhealthy; remediation happens, and
M1-1 is created as a replacement.\n\tIf M1-1 (replacement of
M1) has problems while bootstrapping it will become unhealthy,
and then be\n\tremediated; such operation is considered a retry,
remediation-retry #1.\n\tIf M1-2 (replacement of M1-1) becomes
unhealthy, remediation-retry #2 will happen, etc.\n\nA retry
could happen only after RetryPeriod from the previous retry.\nIf
a machine is marked as unhealthy after MinHealthyPeriod from
the previous remediation expired,\nthis is not considered a
retry anymore because the new issue is assumed unrelated from
the previous one.\n\nIf not set, the remedation will be retried
infinitely."
format: int32
type: integer
minHealthyPeriod:
description: "minHealthyPeriod defines the duration after which
RKE2ControlPlane will consider any failure to a machine unrelated\nfrom
the previous one. In this case the remediation is not considered
a retry anymore, and thus the retry\ncounter restarts from 0.
For example, assuming MinHealthyPeriod is set to 1h (default)\n\n\tM1
become unhealthy; remediation happens, and M1-1 is created as
a replacement.\n\tIf M1-1 (replacement of M1) has problems within
the 1hr after the creation, also\n\tthis machine will be remediated
and this operation is considered a retry - a problem related\n\tto
the original issue happened to M1 -.\n\n\tIf instead the problem
on M1-1 is happening after MinHealthyPeriod expired, e.g. four
days after\n\tm1-1 has been created as a remediation of M1,
the problem on M1-1 is considered unrelated to\n\tthe original
issue happened to M1.\n\nIf not set, this value is defaulted
to 1h."
type: string
retryPeriod:
description: |-
retryPeriod is the duration that RKE2ControlPlane should wait before remediating a machine being created as a replacement
for an unhealthy machine (a retry).

If not set, a retry will happen immediately.
type: string
type: object
replicas:
description: Replicas is the number of replicas for the Control Plane.
format: int32
Expand Down Expand Up @@ -2525,6 +2573,30 @@ spec:
description: Initialized indicates the target cluster has completed
initialization.
type: boolean
lastRemediation:
description: lastRemediation stores info about last remediation performed.
properties:
machine:
description: machine is the machine name of the latest machine
being remediated.
maxLength: 253
minLength: 1
type: string
retryCount:
description: |-
retryCount used to keep track of remediation retry for the last remediated machine.
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
type: integer
timestamp:
description: timestamp is when last remediation happened. It is
represented in RFC3339 form and is in UTC.
format: date-time
type: string
required:
- machine
- retryCount
- timestamp
type: object
observedGeneration:
description: ObservedGeneration is the latest generation observed
by the controller.
Expand Down
Loading
Loading