Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 clusterctl init: add flag for retrying cert-manager readiness check #12055

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

azych
Copy link

@azych azych commented Apr 3, 2025

What this PR does / why we need it:

This introduces a new clusterctl init flag "--retry-cert-manager-readiness-check" that allows to retry the check for an already installed cert-manager, which by default is only attempted once before a new cert-manager installation is started.

When enabled, cert-manager readiness check will be retried for the duration specified in clusterctl config file's cert-manager.timeout entry or for a default timeout.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes #11960

/area clusterctl

azych added 3 commits April 3, 2025 13:59
This introduces a new clusterctl init flag
"--retry-cert-manager-readiness-check" that allows to retry
the check for an already installed cert-manager, which
by default is only attempted once before a new cert-manager
installation is started.

When enabled, cert-manager readiness check will be retried
for the duration specified in clusterctl config file's
cert-manager.timeout entry or for a default timeout.

See: kubernetes-sigs#11960
@k8s-ci-robot k8s-ci-robot added area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @azych!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 3, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @azych. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 3, 2025
@sbueringer
Copy link
Member

sbueringer commented Apr 4, 2025

I would really prefer if we would implement the short term solution described by @chrischdi here #11960 (comment) instead.

With the current PR users have to set the flag to get a behavior that is not brittle. I would like clusterctl to work correctly out of the box (I don't want to add flags that folks then have to set for something that should just work out of the box)

@azych
Copy link
Author

azych commented Apr 4, 2025

I would really prefer if we would implement the short term solution described by @chrischdi here #11960 (comment) instead.

With the current PR users have to set the flag to get a behavior that is not brittle. I would like clusterctl to work correctly out of the box (I don't want to add flags that folks then have to set for something that should just work out of the box)

Thanks for looking at this.

If I understood @chrischdi short term solution correctly, it would be to just hardcode timeout and interval values so that instead of doing the check once, it would be retried 3/4 times within a span of a second. Besides only helping with some scenarios (as stated in the original comment) it would also have the disadvantage of running retries even though we passed false for the retry flag to waitForAPIReady.

I am happy to go with this, but maybe this PR isn't far from what the long-term should be?
Solving this seems to ultimately come down to which flow makes the most sense:
A) there's an already installed cert-manager, users use a flag to wait for it to become ready, on timeout clusterctl still tries to install/update cert-manager
B) there's an already installed cert-manager, users use a flag to opt-out of clusterctl installing cert-manager, clusterctl waits for it to become ready and simply fails on timeout
C) there could be an already installed cert-manager, clusterctl automatically tries to detect if it's installed and waits for it to become ready and simply fails on timeout or attempts installation/update (should this be behind a flag?)

This PR currently implements A) and I think it would not require a lot of work if B) was considered the target long-term solution (B and C were both proposed in the original comment). Personally I can definitely see a strong case for C) and not requiring additional user input if the detection step can be done in a reliable way and there is no good reason behind giving users any choice on this.

@sbueringer
Copy link
Member

sbueringer commented Apr 9, 2025

Besides only helping with some scenarios (as stated in the original comment)

Which additional scenarios would you like to cover? Our idea was to just retry for a bit to mitigate single API server requests failing. This should not be a "wait until cert-manager comes up", cert-manager should already be running in general before clusterctl init is run if it's managed by something else

it would also have the disadvantage of running retries even though we passed false for the retry flag to waitForAPIReady.

This can be solved by adjusting the function signature

A) there's an already installed cert-manager, users use a flag to wait for it to become ready, on timeout clusterctl still tries to install/update cert-manager

This would mean if there is already a cert-manager deployed, but it's not coming up we are just going to overwrite the deployed cert-manager. This would be bad.

B) there's an already installed cert-manager, users use a flag to opt-out of clusterctl installing cert-manager, clusterctl waits for it to become ready and simply fails on timeout

Do you mean users would --retry-cert-manager-readiness-check to opt-out of clusterctl installing cert-manager? Or do you mean once we added an additional flag?

To be honest the more I think about it the more unintuitive this whole thing is.

What about we do the following:

  • Directly add a flag "--manage-cert-manager" (default: true)
  • Adjust the behavior that it makes sense based on whatever this flag is set to:
    • If it's true:
      • check if cert-manager is already deployed by checking if the Certificate & Issuer CRDs are deployed
        • if the CRDs don't exist => deploy cert-manager and wait until ready
    • If it's false:
      • only check if the Certificate & Issuer CRDs are deployed, if not => fail (it's not our job to manage cert-manager)

The current mixup between detecting if we should manage cert-manager and waiting for it to be ready just leads to a few strange edge cases where clusterctl might do the wrong thing

@azych
Copy link
Author

azych commented Apr 10, 2025

Besides only helping with some scenarios (as stated in the original comment)

Which additional scenarios would you like to cover? Our idea was to just retry for a bit to mitigate single API server requests failing. This should not be a "wait until cert-manager comes up", cert-manager should already be running in general before clusterctl init is run if it's managed by something else

I was referring to the scenario mentioned here: #11960 (comment) - ie. when the cert-manager installation hasn't finished yet. But I guess it would the same if it was just non responsive for any other reason within one second that was proposed for a short-term solution.

A) there's an already installed cert-manager, users use a flag to wait for it to become ready, on timeout clusterctl still tries to install/update cert-manager

This would mean if there is already a cert-manager deployed, but it's not coming up we are just going to overwrite the deployed cert-manager. This would be bad.

Just for the record, that is currently how the main branch logic works - maybe it should be a bug instead of a new feature?

B) there's an already installed cert-manager, users use a flag to opt-out of clusterctl installing cert-manager, clusterctl waits for it to become ready and simply fails on timeout

Do you mean users would --retry-cert-manager-readiness-check to opt-out of clusterctl installing cert-manager? Or do you mean once we added an additional flag?

Generally I was trying to understand what the long-term solution/logic should be. With B) option specifically, I was proposing a single flag (not --retry-cert...) for opting out (or in) of cert-manager being installed by clusterctl. As I mentioned in the last paragraph in my comment, I think that the code that is currently in this PR would not require a lot of changes if that option was the long-term solution, including renaming of the flag.

What about we do the following:

* Directly add a flag "--manage-cert-manager" (default: true)

* Adjust the behavior that it makes sense based on whatever this flag is set to:
  
  * If it's true:
    
    * check if cert-manager is already deployed by checking if the Certificate & Issuer CRDs are deployed

      * if the CRDs don't exist => deploy cert-manager and wait until ready

I'm assuming we'd want to fail immediately if the CRD check is successful (ie. Certificate & Issuer are actually deployed)?

  * If it's false:
    
    * only check if the Certificate & Issuer CRDs are deployed, if not => fail (it's not our job to manage cert-manager)

Should we not check if the API is actually ready? What about a situation when it wouldn't be ready/responding (also goes back to #11960 (comment))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make retry for clusterctl check for cert-manager API ready configurable
3 participants