-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 clusterctl init: add flag for retrying cert-manager readiness check #12055
base: main
Are you sure you want to change the base?
Conversation
This introduces a new clusterctl init flag "--retry-cert-manager-readiness-check" that allows to retry the check for an already installed cert-manager, which by default is only attempted once before a new cert-manager installation is started. When enabled, cert-manager readiness check will be retried for the duration specified in clusterctl config file's cert-manager.timeout entry or for a default timeout. See: kubernetes-sigs#11960
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @azych! |
Hi @azych. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I would really prefer if we would implement the short term solution described by @chrischdi here #11960 (comment) instead. With the current PR users have to set the flag to get a behavior that is not brittle. I would like clusterctl to work correctly out of the box (I don't want to add flags that folks then have to set for something that should just work out of the box) |
Thanks for looking at this. If I understood @chrischdi short term solution correctly, it would be to just hardcode timeout and interval values so that instead of doing the check once, it would be retried 3/4 times within a span of a second. Besides only helping with some scenarios (as stated in the original comment) it would also have the disadvantage of running retries even though we passed I am happy to go with this, but maybe this PR isn't far from what the long-term should be? This PR currently implements A) and I think it would not require a lot of work if B) was considered the target long-term solution (B and C were both proposed in the original comment). Personally I can definitely see a strong case for C) and not requiring additional user input if the detection step can be done in a reliable way and there is no good reason behind giving users any choice on this. |
Which additional scenarios would you like to cover? Our idea was to just retry for a bit to mitigate single API server requests failing. This should not be a "wait until cert-manager comes up", cert-manager should already be running in general before clusterctl init is run if it's managed by something else
This can be solved by adjusting the function signature
This would mean if there is already a cert-manager deployed, but it's not coming up we are just going to overwrite the deployed cert-manager. This would be bad.
Do you mean users would To be honest the more I think about it the more unintuitive this whole thing is. What about we do the following:
The current mixup between detecting if we should manage cert-manager and waiting for it to be ready just leads to a few strange edge cases where clusterctl might do the wrong thing |
I was referring to the scenario mentioned here: #11960 (comment) - ie. when the cert-manager installation hasn't finished yet. But I guess it would the same if it was just non responsive for any other reason within one second that was proposed for a short-term solution.
Just for the record, that is currently how the main branch logic works - maybe it should be a bug instead of a new feature?
Generally I was trying to understand what the long-term solution/logic should be. With B) option specifically, I was proposing a single flag (not
I'm assuming we'd want to fail immediately if the CRD check is successful (ie. Certificate & Issuer are actually deployed)?
Should we not check if the API is actually ready? What about a situation when it wouldn't be ready/responding (also goes back to #11960 (comment))? |
What this PR does / why we need it:
This introduces a new clusterctl init flag "--retry-cert-manager-readiness-check" that allows to retry the check for an already installed cert-manager, which by default is only attempted once before a new cert-manager installation is started.
When enabled, cert-manager readiness check will be retried for the duration specified in clusterctl config file's cert-manager.timeout entry or for a default timeout.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #11960
/area clusterctl