Skip to content

OCPBUGS-54238: Update CSR status condition appropriately #2674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pperiyasamy
Copy link
Member

@pperiyasamy pperiyasamy commented Mar 31, 2025

When CSR signing reattempt happens, signer controller is not updating existing CertificateFailed condition type, instead it tries to add another CertificateFailed condition and leads to Duplicate value: "Failed" error in the network co status,

 % oc get co network
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.19.0-0.ci.test-2025-03-26-015315-ci-ln-g8dqch2-latest   True        False         True       4h      Unable to update csr: CertificateSigningRequest.certificates.k8s.io "ipsec-csr-test-80237" is invalid: status.conditions[1].type: Duplicate value: "Failed"

Fixing it just by updating existing CertificateFailed condition so that network status don't get updated unnecessarily for this case.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 31, 2025
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: This pull request references Jira Issue OCPBUGS-54238, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The signer controller is reflecting CSR approval status into network operator status during signing process for every CSR, when CSR is removed it's not getting network status back into original state because CSR is no longer available.

When CSR signing reattempt happens upon a failure, signer controller is not updating existing CertificateFailed condition type, instead it tries to add another CertificateFailed condition and leads to below error:

% oc get co network
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.19.0-0.ci.test-2025-03-26-015315-ci-ln-g8dqch2-latest   True        False         True       4h      Unable to update csr: CertificateSigningRequest.certificates.k8s.io "ipsec-csr-test-80237" is invalid: status.conditions[1].type: Duplicate value: "Failed"

so fixing it by updating existing CertificateFailed condition.

Depends on #2560.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pperiyasamy
Copy link
Member Author

/retest

@pperiyasamy
Copy link
Member Author

/assign @martinkennelly @trozet

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2025
Copy link
Contributor

@martinkennelly martinkennelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2025
@pperiyasamy pperiyasamy force-pushed the cert-signer-controller-status branch from bd8d812 to 749d72a Compare April 16, 2025 08:06
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2025
@@ -88,6 +89,8 @@ func (r *ReconcileCSR) Reconcile(ctx context.Context, request reconcile.Request)
if err != nil {
if apierrors.IsNotFound(err) {
// Request object not found, could have been deleted after reconcile request.
// restore network status when CSR is deleted.
r.status.SetNotDegraded(statusmanager.CertificateSigner)
Copy link
Contributor

@kyrtapz kyrtapz Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is more than one CSR that is failing and one of those gets removed?
Was it always broken like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @kyrtapz, the status gets updated again when next signing attempt happens for another CSR.

Copy link
Contributor

@kyrtapz kyrtapz Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see this as the way to go, what if a CSR signing fails for one object but another CSR expired in the meantime and got garbage collected? We would remove the error.
To me this changes how CNO behaved until now. There is a wider issue of the status being overwritten for every CSR but this is a different discussion.
Is this change still necessary with the other fix you did?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize, I don't think we should change the CNO degraded state based on any particular CSR object, this should be reserved only for internal controller failures.

@@ -251,7 +254,7 @@ func signerFailure(r *ReconcileCSR, csr *csrv1.CertificateSigningRequest, reason

// Update the status conditions on the CSR object
func updateCSRStatusConditions(r *ReconcileCSR, csr *csrv1.CertificateSigningRequest, reason string, message string) {
csr.Status.Conditions = append(csr.Status.Conditions, csrv1.CertificateSigningRequestCondition{
setCertificateSigningRequestCondition(&csr.Status.Conditions, csrv1.CertificateSigningRequestCondition{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use this?

func SetStatusCondition(conditions *[]metav1.Condition, newCondition metav1.Condition) (changed bool) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is updating k8s.io/api/certificates/v1.CertificateSigningRequestCondition object here, so not using generic Condition object.

@kyrtapz
Copy link
Contributor

kyrtapz commented Apr 17, 2025

@pperiyasamy what was the CSR that caused the issue? We should not retry denied CSRs

existingCondition.Status = newCondition.Status
existingCondition.LastTransitionTime = metav1.NewTime(time.Now())
}
existingCondition.Reason = newCondition.Reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you updating the LastTransitionTime if the Status doesn't change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated it.

When CSR signing reattempt happens, signer controller is not updating existing
CertificateFailed condition type, instead it tries to add another CertificateFailed
condition and leads to Duplicate value: "Failed" error in the network co status,
so fixing it just by updating existing CertificateFailed condition.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
@pperiyasamy pperiyasamy force-pushed the cert-signer-controller-status branch from 749d72a to a33bca4 Compare April 17, 2025 17:14
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2025
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: This pull request references Jira Issue OCPBUGS-54238, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

When CSR signing reattempt happens, signer controller is not updating existing CertificateFailed condition type, instead it tries to add another CertificateFailed condition and leads to Duplicate value: "Failed" error in the network co status,

% oc get co network
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.19.0-0.ci.test-2025-03-26-015315-ci-ln-g8dqch2-latest   True        False         True       4h      Unable to update csr: CertificateSigningRequest.certificates.k8s.io "ipsec-csr-test-80237" is invalid: status.conditions[1].type: Duplicate value: "Failed"

Fixing it just by updating existing CertificateFailed condition so that network status don't get updated unnecessarily for this case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pperiyasamy pperiyasamy changed the title OCPBUGS-54238: Reset network status when CSR is deleted OCPBUGS-54238: Update CSR status condition appropriately Apr 17, 2025
@kyrtapz
Copy link
Contributor

kyrtapz commented Apr 17, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2025
Copy link
Contributor

openshift-ci bot commented Apr 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, martinkennelly, pperiyasamy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 50405c0 and 2 for PR HEAD a33bca4 in total

@zshi-redhat
Copy link
Contributor

/retest-required

1 similar comment
@ricky-rav
Copy link
Contributor

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 50405c0 and 2 for PR HEAD a33bca4 in total

@ricky-rav
Copy link
Contributor

/retest-required

1 similar comment
@ricky-rav
Copy link
Contributor

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b0aaa7d and 1 for PR HEAD a33bca4 in total

@zshi-redhat
Copy link
Contributor

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b0aaa7d and 2 for PR HEAD a33bca4 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6268c4e and 1 for PR HEAD a33bca4 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6268c4e and 2 for PR HEAD a33bca4 in total

@pperiyasamy
Copy link
Member Author

/retest-required

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-windows

perma-failing due to https://issues.redhat.com/browse/WINC-1384

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-ipsec-upgrade

no related to this change
https://issues.redhat.com/browse/OCPBUGS-55262

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-windows

In response to this:

/override ci/prow/e2e-aws-ovn-windows

perma-failing due to https://issues.redhat.com/browse/WINC-1384

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-ipsec-upgrade

In response to this:

/override ci/prow/e2e-aws-ovn-ipsec-upgrade

no related to this change
https://issues.redhat.com/browse/OCPBUGS-55262

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD bcf7b32 and 1 for PR HEAD a33bca4 in total

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-windows
/override ci/prow/e2e-aws-ovn-ipsec-upgrade

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-ipsec-upgrade, ci/prow/e2e-aws-ovn-windows

In response to this:

/override ci/prow/e2e-aws-ovn-windows
/override ci/prow/e2e-aws-ovn-ipsec-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-windows

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-windows

In response to this:

/override ci/prow/e2e-aws-ovn-windows

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-windows

This has passed before and last iteration failed during deprovision due to overall job timeout

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2025-04-23T15:29:31Z"}
INFO[2025-04-23T15:29:31Z] Received signal.                              signal=interrupt
...
ERRO[2025-04-23T15:30:55Z] Some steps failed:                           
ERRO[2025-04-23T15:30:55Z] 
  * could not run steps: execution cancelled 
INFO[2025-04-23T15:30:55Z] Reporting job state 'failed' with reason 'executing_graph:interrupted' 
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:264","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 1h0m0s grace period","severity":"error","time":"2025-04-23T15:30:55Z"}

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-windows

In response to this:

/override ci/prow/e2e-aws-ovn-windows

This has passed before and last iteration failed during deprovision due to overall job timeout

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2025-04-23T15:29:31Z"}
INFO[2025-04-23T15:29:31Z] Received signal.                              signal=interrupt
...
ERRO[2025-04-23T15:30:55Z] Some steps failed:                           
ERRO[2025-04-23T15:30:55Z] 
 * could not run steps: execution cancelled 
INFO[2025-04-23T15:30:55Z] Reporting job state 'failed' with reason 'executing_graph:interrupted' 
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:264","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 1h0m0s grace period","severity":"error","time":"2025-04-23T15:30:55Z"}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-upgrade

last override was meant to be this one

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-upgrade

In response to this:

/override ci/prow/e2e-aws-ovn-upgrade

last override was meant to be this one

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade a33bca4 link false /test 4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade
ci/prow/e2e-ovn-hybrid-step-registry a33bca4 link false /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-aws-ovn-serial a33bca4 link false /test e2e-aws-ovn-serial
ci/prow/4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-upgrade a33bca4 link false /test 4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-upgrade
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 a33bca4 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/security a33bca4 link false /test security
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration a33bca4 link false /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/e2e-aws-hypershift-ovn-kubevirt a33bca4 link false /test e2e-aws-hypershift-ovn-kubevirt
ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration a33bca4 link false /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
ci/prow/okd-scos-e2e-aws-ovn a33bca4 link false /test okd-scos-e2e-aws-ovn
ci/prow/4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade a33bca4 link false /test 4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
ci/prow/e2e-ovn-step-registry a33bca4 link false /test e2e-ovn-step-registry
ci/prow/e2e-aws-ovn-single-node a33bca4 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec

Has passed before, unrelated, failing on other PRs, tracked in https://issues.redhat.com/browse/OCPBUGS-55280

Copy link
Contributor

openshift-ci bot commented Apr 23, 2025

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec

Has passed before, unrelated, failing on other PRs, tracked in https://issues.redhat.com/browse/OCPBUGS-55280

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit e6d5a38 into openshift:master Apr 23, 2025
24 of 37 checks passed
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: Jira Issue OCPBUGS-54238: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-54238 has been moved to the MODIFIED state.

In response to this:

When CSR signing reattempt happens, signer controller is not updating existing CertificateFailed condition type, instead it tries to add another CertificateFailed condition and leads to Duplicate value: "Failed" error in the network co status,

% oc get co network
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.19.0-0.ci.test-2025-03-26-015315-ci-ln-g8dqch2-latest   True        False         True       4h      Unable to update csr: CertificateSigningRequest.certificates.k8s.io "ipsec-csr-test-80237" is invalid: status.conditions[1].type: Duplicate value: "Failed"

Fixing it just by updating existing CertificateFailed condition so that network status don't get updated unnecessarily for this case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-network-operator
This PR has been included in build cluster-network-operator-container-v4.19.0-202504232036.p0.ge6d5a38.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants