Skip to content

OCPBUGS-56220: Ensure the build controller restarts on upgrade #5100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

umohnani8
Copy link
Contributor

Closes https://issues.redhat.com/browse/OCPBUGS-56220

- What I did
Fix the reconcilation of the build controller to
ensure that new changes are actually applied when
there is an update to the deployment.

- How to verify it
Upgrade from 4.18 to 4.19, ensure that the upgrade is successful and the machine-os-builder pod is started with the new image.

- Description for the changelog
Ensure the build controller restarts when updates are detected.

@openshift-ci-robot
Copy link
Contributor

@umohnani8: No Jira issue with key OSPBUGS-56220 exists in the tracker at https://issues.redhat.com/.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

In response to this:

Closes https://issues.redhat.com/browse/OCPBUGS-56220

- What I did
Fix the reconcilation of the build controller to
ensure that new changes are actually applied when
there is an update to the deployment.

- How to verify it
Upgrade from 4.18 to 4.19, ensure that the upgrade is successful and the machine-os-builder pod is started with the new image.

- Description for the changelog
Ensure the build controller restarts when updates are detected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@umohnani8 umohnani8 changed the title OSPBUGS-56220: Ensure the build controller restarts on upgrade OCPBUGS-56220: Ensure the build controller restarts on upgrade Jun 2, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 2, 2025
@umohnani8
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 2, 2025
@openshift-ci-robot
Copy link
Contributor

@umohnani8: This pull request references Jira Issue OCPBUGS-56220, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Closes https://issues.redhat.com/browse/OCPBUGS-56220

- What I did
Fix the reconcilation of the build controller to
ensure that new changes are actually applied when
there is an update to the deployment.

- How to verify it
Upgrade from 4.18 to 4.19, ensure that the upgrade is successful and the machine-os-builder pod is started with the new image.

- Description for the changelog
Ensure the build controller restarts when updates are detected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@umohnani8: This pull request references Jira Issue OCPBUGS-56220, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2025
@umohnani8
Copy link
Contributor Author

@cheesesashimi @djoshy @yuqi-zhang PTAL

Copy link
Contributor

@djoshy djoshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, just a couple of clarifying questions!

// The default values from k8s seem to be set after the resource is created, so set those in the new spec
// we are comparing with to ensure we don't get false positives
for i := range required.Spec.Containers {
ctr := &required.Spec.Containers[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is the key piece, ensuring that the container image is up to date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure we don't keep updating because the existing and required config don't match. During testing, I noticed that even though nothing was changing with the config, we would always end up in updating with ApplyDeployment, which does a DeepEqual to confirm whether something changed or not before updating. Looking through the differences of existing and required, I saw that we were using default values for the TerminationMessagePath and ImagePullPolicy, which get set after the resource is created so existing had it but required did not causing the DeepEqual to not return true.
This is an attempt to populate those default fields before hand so we are not updating constantly because of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, understood!

}
klog.Infof("Starting Machine OS Builder pod because MachineConfigPool(s) opted into layering")
return optr.startMachineOSBuilderDeployment(mob, layeredMCPs)
if len(layeredMCPs) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an if block at the end of the function with len(layeredMCPs) != 0 , which won't be called with the return enclosed here. Perhaps we should move that into this block as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, fixed!

Copy link
Contributor

@djoshy djoshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. I'll defer to @cheesesashimi for final tags since he wrote the original deployment code.

@umohnani8
Copy link
Contributor Author

/retest

3 similar comments
@umohnani8
Copy link
Contributor Author

/retest

@umohnani8
Copy link
Contributor Author

/retest

@umohnani8
Copy link
Contributor Author

/retest

Fix the reconcilation of the build controller to
ensure that new changes are actually applied when
there is an update to the deployment.

Signed-off-by: Urvashi <[email protected]>
@umohnani8
Copy link
Contributor Author

/retest

1 similar comment
@umohnani8
Copy link
Contributor Author

/retest

@umohnani8
Copy link
Contributor Author

/test e2e-gcp-op-ocl

@pablintino
Copy link
Contributor

It looks good to me. I appreciate the comments you added in the changes to the ensurePodTemplateSpec function.
I let others, with more experience in this piece of code, to approve it.

@cheesesashimi
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2025
Copy link
Contributor

openshift-ci bot commented Jun 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, djoshy, umohnani8

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cheesesashimi,djoshy,umohnani8]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4e6c150 and 2 for PR HEAD 3b58bc2 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4e6c150 and 2 for PR HEAD 3b58bc2 in total

@ptalgulk01
Copy link

Pre-merge verification:
Verified using IPI based AWS cluster on 4.19.

  • Applied the MOSC
  • Check the MOSB was successful
  • Before upgrade check the image of builder pod and other mco pod are same
$ oc debug node/ip-10-0-30-115.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-30-115us-east-2computeinternal-debug-tb49x ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:157a65922e5b4ab3c8d982158db6558c1d4ea9b06e279118ee634c9bb1d0d793
                   Digest: sha256:157a65922e5b4ab3c8d982158db6558c1d4ea9b06e279118ee634c9bb1d0d793
                  Version: 9.6.20250610-1 (2025-06-13T08:09:42Z)

Removing debug pod ...

$ oc -n openshift-machine-config-operator get pods -l k8s-app=machine-os-builder -o jsonpath='{.items[*].spec.containers[*].image}'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b37620d68a01ec5c3d173c57b001d6f8eafb498514513380c7704e3362017587

$ oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-server -o jsonpath='{.items[*].spec.containers[*].image}'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b37620d68a01ec5c3d173c57b001d6f8eafb498514513380c7704e3362017587 

$ oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-controller -o jsonpath='{.items[*].spec.containers[*].image}'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b37620d68a01ec5c3d173c57b001d6f8eafb498514513380c7704e3362017587 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7ea177ff2a08e6856c4433a2a95a4ea020b04fa4669835df168a5e6caa10002

$ oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-operator -o jsonpath='{.items[*].spec.containers[*].image}'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b37620d68a01ec5c3d173c57b001d6f8eafb498514513380c7704e3362017587 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a7ea177ff2a08e6856c4433a2a95a4ea020b04fa4669835df168a5e6caa10002
  • Upgraded the cluster to 4.20 with this fix. Check
  • New MOSB is triggred
$ oc get machineosbuilds
NAME                                        PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED   AGE
worker-4-3aa25515fe8dbd45ba006975d700949d   False      False      True        False         False    132m
worker-4-cfcca78b30d6008f67c0c8c794bd8833   False      False      True        False         False    47m

$ oc debug node/ip-10-0-25-45.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-25-45us-east-2computeinternal-debug-c78zq ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:75c02a4ac885d477004ad7ff426344846c398356cff67cd734661fe625cf354e
                   Digest: sha256:75c02a4ac885d477004ad7ff426344846c398356cff67cd734661fe625cf354e
                  Version: 9.6.20250610-1 (2025-06-13T16:04:22Z)

Removing debug pod ...

  • Check builder pod is on new image
$ oc -n openshift-machine-config-operator get pods -l k8s-app=machine-os-builder -o jsonpath='{.items[*].spec.containers[*].image}'
registry.build10.ci.openshift.org/ci-ln-i258mtb/stable@sha256:fb03c3a6401c04ace722cfa857de48622dc76503653d7e8100989b69d98ff02c

Just a small doubt regarding this error seen in os-machine-builder pod and saw this mentioned in ticket too is this related to the upgrade ?

E0613 16:04:46.772498       1 wrappedqueue.go:250] "Unhandled Error" err="Updating MachineOSBuild \"worker-4-cfcca78b30d6008f67c0c8c794bd8833\" failed: could not update MachineOSBuild: imagebuilder for MachineOSBuild \"worker-4-cfcca78b30d6008f67c0c8c794bd8833\" encountered an error: could not clean up build job objects: could not delete ephemeral configmap etc-policy-worker-4-cfcca78b30d6008f67c0c8c794bd8833: configmaps \"etc-policy-worker-4-cfcca78b30d6008f67c0c8c794bd8833\" not found"

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b3377c9 and 1 for PR HEAD 3b58bc2 in total

@umohnani8
Copy link
Contributor Author

/cherry-pick release-4.19

@openshift-cherrypick-robot

@umohnani8: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@umohnani8
Copy link
Contributor Author

/retest

1 similar comment
@umohnani8
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b3377c9 and 2 for PR HEAD 3b58bc2 in total

@umohnani8
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b3377c9 and 2 for PR HEAD 3b58bc2 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b3377c9 and 2 for PR HEAD 3b58bc2 in total

@yuqi-zhang
Copy link
Contributor

/override ci/prow/e2e-gcp-op-single-node

Passed previous and is otherwise failing on CI issues, overriding

Copy link
Contributor

openshift-ci bot commented Jun 18, 2025

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Passed previous and is otherwise failing on CI issues, overriding

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@yuqi-zhang
Copy link
Contributor

/override ci/prow/e2e-gcp-op

Same reason as above

Copy link
Contributor

openshift-ci bot commented Jun 18, 2025

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-gcp-op

In response to this:

/override ci/prow/e2e-gcp-op

Same reason as above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Jun 18, 2025

@umohnani8: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a12b62b into openshift:main Jun 18, 2025
19 checks passed
@openshift-ci-robot
Copy link
Contributor

@umohnani8: Jira Issue OCPBUGS-56220: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-56220 has been moved to the MODIFIED state.

In response to this:

Closes https://issues.redhat.com/browse/OCPBUGS-56220

- What I did
Fix the reconcilation of the build controller to
ensure that new changes are actually applied when
there is an update to the deployment.

- How to verify it
Upgrade from 4.18 to 4.19, ensure that the upgrade is successful and the machine-os-builder pod is started with the new image.

- Description for the changelog
Ensure the build controller restarts when updates are detected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@umohnani8: new pull request created: #5127

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.20.0-202506182313.p0.ga12b62b.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants