Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm issue #3152 ControlPlane node setup failing with "etcdserver: can only promote a learner member" #130782

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BernardMC
Copy link

@BernardMC BernardMC commented Mar 13, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes issues where kubeadm tries to promote a learner member that is already promoted
This can happen when a previous promote does not return a success but the promotion actually succeeded on
the backend. This caused the script to eventually timeout and node bring up to fail
Also added an additional call to remove the member we failed to add if member promotion failed entirely

Which issue(s) this PR fixes:

kubernetes/kubeadm#3152

Fixes kubernetes/kubeadm#3152

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

kubeadm: fixed issue where etcd member promotion fails with an error saying the member was already promoted

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 13, 2025
Copy link

linux-foundation-easycla bot commented Mar 13, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @BernardMC!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @BernardMC. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Mar 13, 2025
@neolit123
Copy link
Member

/release-note-edit

kubeadm: fixed issue where etcd member promotion fails with an error saying the member was already promoted

@neolit123
Copy link
Member

@BernardMC please keep the commits squashed to 1.
note that any merge commits cannot be merged.

@neolit123
Copy link
Member

please change this in the pr description

https://github.com/kubernetes/kubeadm/issues/3152

Fixes #

to be

Fixes https://github.com/kubernetes/kubeadm/issues/3152

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @pacoxu
ptal if possible.

@BernardMC please message here once the cla is approved.

@k8s-ci-robot k8s-ci-robot requested a review from pacoxu March 13, 2025 14:58
@@ -586,6 +595,11 @@ func (c *Client) MemberPromote(learnerID uint64) error {
return false, nil
})
if err != nil {
klog.V(5).Infof("[etcd] Failed to promote the learner %s before timeout. Attempting to remove learner member", strconv.FormatUint(learnerID, 16))
_, err = cli.MemberRemove(context.Background(), learnerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that if promoting an etcd learner fails, we should remove the unstable member. But why don't we move the member removal action outside the Client.MemberPromote method, and when the Client.MemberPromote call fails, we invoke Client.RemoveMember to delete the member?

Since Client.RemoveMember contains retry mechanisms, it might better handle member removal failures. For example:

	err = etcdClient.MemberPromote(learnerID)
	if err != nil {
		klog.V(5).Infof("[etcd] Failed to promote the learner %s before timeout. Attempting to remove learner member", strconv.FormatUint(learnerID, 16))
		if _, removeErr := etcdClient.RemoveMember(learnerID); removeErr != nil {
			klog.V(5).Infof("[etcd] Removing the learner %s failed: %v", strconv.FormatUint(learnerID, 16), removeErr)
		}
		return err
	}

If you believe that removing the unstable member after a failed promote etcd learner is just a best-effort action and does not require retries, please add comments to clarify this behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes sense to make the remove call an explicit call from outside rather than inside the member promote

}
if !isLearner {
klog.V(1).Infof("[etcd] Member %s already promoted.", strconv.FormatUint(learnerID, 16))
return false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup should be true

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/code-generation area/test and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 14, 2025
@BernardMC
Copy link
Author

You commit user bconry is not the github user name. This may fail the CLA.

Yeah my global commit email was @vmware.com and I did not update it to @broadcom.com
I have done it now and resquashed the commits so hopefully just waiting for someone in my org to approve the CLA

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 20, 2025
@BernardMC
Copy link
Author

@HirazawaUi @pacoxu I got my CLA resolved!

@HirazawaUi
Copy link
Contributor

Did you miss the content mentioned comment?
ref: #130782 (comment)

@HirazawaUi
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 20, 2025
@BernardMC
Copy link
Author

Did you miss the content mentioned comment? ref: #130782 (comment)

I'll do a separate PR for that, since leaving a zombie etcd member issue isn't unique to this "learner already promoted" issue. I'm also not as familiar with that part of code yet 😊

Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7c728138cdf1259d0b6b83c4a92759982bccd768

@pacoxu
Copy link
Member

pacoxu commented Mar 21, 2025

Since we missed the code freeze of v1.33, should we cherry-pick it after v1.33.0 release? @neolit123

@HirazawaUi
Copy link
Contributor

I'll do a separate PR for that, since leaving a zombie etcd member issue isn't unique to this "learner already promoted" issue. I'm also not as familiar with that part of code yet 😊

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BernardMC, SataQiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2025
@neolit123
Copy link
Member

Since we missed the code freeze of v1.33, should we cherry-pick it after v1.33.0 release? @neolit123

sgtm

@BernardMC
Copy link
Author

Pending - Not mergeable. Must be in milestone v1.33.
I see this, so are we waiting for v1.34 branch to exist and then merging it to that?

@neolit123
Copy link
Member

since it's not a critical bug we should ideally wait for 1.33.0 to first release. then this pr will auto-merge in 1.34-pre.
at that point you can use the /hack/chery_pick_pull.sh script to send automated cherry picks for branches:

  • release-1.33
  • release-1.32
  • release-1.31
  • release-1.30

versions in support:
https://kubernetes.io/releases/

schedule for 1.33:
https://github.com/kubernetes/sig-release/tree/master/releases/release-1.33

@neolit123
Copy link
Member

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/kubeadm area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

ControlPlane node setup failing with "etcdserver: can only promote a learner member"
7 participants