Skip to content

OCPBUGS-56008: default Azure to create VM user-assigned identities #9718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

patrickdillon
Copy link
Contributor

@patrickdillon patrickdillon commented May 15, 2025

#9538 changed the installer's default behavior so that it no longer created user-assigned identities and attached them to VMs for Azure installs. OCPBUGS-56008 revealed that the acr-credentials-provider still depends on the node identity and the missing identities breaks pulling images from private registries in Azure.

Simply reverting #9538 would cause additional issues, particularly:

This PR resolves these issues and the bug by keeping the install config API in place, but changing the default back so the installer continues to create and attach identities by default. ARO & internal Red Hat developers (assuming they don't need ACR) can set installConfig.platform.azure.defaultMachinePool.identity.type: None so that no identity is created.

Upstream is working on a solution that will use credentials requests. Once that is landed, we can set the default identity type back to None and remove the code to create identities. Upstream tracker:
kubernetes/enhancements#4412

@openshift-ci-robot
Copy link
Contributor

@patrickdillon: No Jira issue with key OCPGBUGS-56008 exists in the tracker at https://issues.redhat.com/.
Once a valid jira issue is referenced in the title of this pull request, request a refresh with /jira refresh.

In response to this:

Removing the identity from the Azure nodes in
#9538
broke the acr-credentials-provider, which allows pulling images from private registries in Azure.

Instead of re-adding the identity attached to the node, we can add a secret which will add credentials to the cloud provider config.

Upstream is working on a solution that will use credentials requests. Once that is landed, we can remove this. Track:
kubernetes/enhancements#4412

/wip

Opening this for testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon patrickdillon changed the title OCPGBUGS-56008: creds for Azure image pull OCPBUGS-56008: creds for Azure image pull May 15, 2025
@openshift-ci openshift-ci bot requested review from barbacbd and jhixson74 May 15, 2025 20:01
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 15, 2025
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-56008, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Removing the identity from the Azure nodes in
#9538
broke the acr-credentials-provider, which allows pulling images from private registries in Azure.

Instead of re-adding the identity attached to the node, we can add a secret which will add credentials to the cloud provider config.

Upstream is working on a solution that will use credentials requests. Once that is landed, we can remove this. Track:
kubernetes/enhancements#4412

/wip

Opening this for testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Contributor Author

/hold

Not sure if this is what we want to do. I think so, but wanted to open it quickly for testing.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2025
@patrickdillon patrickdillon force-pushed the OCPBUGS-56008-azure-image-pull branch from 11fc843 to c032e69 Compare May 19, 2025 19:59
@patrickdillon patrickdillon changed the title OCPBUGS-56008: creds for Azure image pull OCPBUGS-56008: default Azure to create user-assigned identity for node May 19, 2025
@patrickdillon patrickdillon changed the title OCPBUGS-56008: default Azure to create user-assigned identity for node OCPBUGS-56008: default Azure to create VM user-assigned identities May 19, 2025
@patrickdillon
Copy link
Contributor Author

/hold cancel
/jira refresh

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2025
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-56008, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/hold cancel
/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 May 19, 2025 20:21
@patrickdillon patrickdillon force-pushed the OCPBUGS-56008-azure-image-pull branch from c032e69 to 67e25c3 Compare May 19, 2025 20:24
@shellyyang1989
Copy link
Contributor

looking at the job, is it a regression?

level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating user-assigned identity: please ensure your user credentials have the User Access Admin Role or if you are not utilizing an Azure Container Registry you can set installconfig.platform.azure.defaultMachinePlatform.identity: None to skip the creation of the identity: creation failed with: failed to create user assigned identity ci-op-pbxggpj7-576ff-q9wsg-identity: PUT https://management.azure.com/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-pbxggpj7-576ff-q9wsg-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/ci-op-pbxggpj7-576ff-q9wsg-identity
level=error msg=--------------------------------------------------------------------------------
level=error msg=RESPONSE 403: 403 Forbidden
level=error msg=ERROR CODE: AuthorizationFailed
level=error msg=--------------------------------------------------------------------------------
level=error msg={
level=error msg=  "error": {
level=error msg=    "code": "AuthorizationFailed",
level=error msg=    "message": "The client '9bac92c7-5636-440a-b050-2a61af171cfd' with object id '9bac92c7-5636-440a-b050-2a61af171cfd' does not have authorization to perform action 'Microsoft.ManagedIdentity/userAssignedIdentities/write' over scope '/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-pbxggpj7-576ff-q9wsg-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/ci-op-pbxggpj7-576ff-q9wsg-identity' or the scope is invalid. If access was recently granted, please refresh your credentials."
level=error msg=  }
level=error msg=}
level=error msg=--------------------------------------------------------------------------------
level=error

@theobarberbany
Copy link
Contributor

@shellyyang1989 I'd guess that it's a permissions issue on the install for all of the CI Jobs :) QE testing IIUC shows this fixes the issue :)

For context, the reason the installer team wanted to change this behaviour was to remove the User Access Admin Role from the install process, which means they'd no longer create a user identity at install time (at the request of a large customer, and to generally improve our security positioning). I don't think they realised at the time that the credential providers were relying on this identity!

func defaultAzureMachinePoolPlatform() azuretypes.MachinePool {
func defaultAzureMachinePoolPlatform(env azuretypes.CloudEnvironment) azuretypes.MachinePool {
idType := capz.VMIdentityUserAssigned
if env == azuretypes.StackCloud {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we special casing stackcloud? won't this then break stackcloud? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure Stack has never used node-attached identities, so I think we're safe here. TBH, I'm not certain why we never attached them, but the current state of this PR would continue the existing behavior we already have.

roleAssignmentsClient := clientFactory.NewRoleAssignmentsClient()
roleAssignmentUUID := uuid.New().String()

// XXX: Azure doesn't like creating an identity and immediately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we love azure </3

Copy link
Contributor

@theobarberbany theobarberbany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question but otherwise LGTM :)

}

func createUserAssignedIdentity(ctx context.Context, in identityInput) error {
userAssignedIdentityName := fmt.Sprintf("%s-identity", in.infraID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the code in this function is existing code that is being restored from c26d2bb

@patrickdillon patrickdillon force-pushed the OCPBUGS-56008-azure-image-pull branch from 67e25c3 to 49ab305 Compare May 20, 2025 17:50
@patrickdillon
Copy link
Contributor Author

fixed up some nil pointer exceptions in the utils.go logic and updated some doc text.

openshift#9538 switched the installer
to not create user-assigned identities for VMs, and exposed an API
for users to bring-their-own identities and attach them to nodes.

OCPBUGS-56008 shows that the kubelet still depends on the node
identity to pull images from Azure Container Registry (ACR). To
resolve this issue, this commit sets the default back to using
an installer-generated identity attached to the node. The API is
still exposed in the install config, so users who do not utilize
ACR can set the identity type to None and install with less privileged
credentials.

When upstream work lands to allow these credentials to be managed
via credentialsrequests, we can go set the default identity to None
and remove the logic for creating identities. The upstream work
is tracked here and looks like it should be available in the next
release:

kubernetes/enhancements#4412
This restores code removed in c26d2bb
for creating a user-assigned identity to attach to nodes.

The code has been moved to its own file and is wrapped in a conditional
so that the identity is only created when needed.
This commit conditionally sets fields in the cloud provider config
to indicate that the the VM identity should be used to authenticate
the ACR credential provider. If the installer is creating the identity
then we set the cloud config to use that attached identity. This has
been the behavior for openshif in all previous releases.
This revendors the azure packages that are needed to create a
user-assigned identity.
@patrickdillon patrickdillon force-pushed the OCPBUGS-56008-azure-image-pull branch from 49ab305 to 8e54931 Compare May 20, 2025 18:21
@jhixson74
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 20, 2025
@jhixson74
Copy link
Member

/approve

Copy link
Contributor

openshift-ci bot commented May 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhixson74

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 94c2b43 and 2 for PR HEAD 8e54931 in total

Copy link
Contributor

openshift-ci bot commented May 21, 2025

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-multi-network 8e54931 link false /test e2e-vsphere-ovn-multi-network
ci/prow/e2e-azurestack 8e54931 link false /test e2e-azurestack
ci/prow/azure-ovn-marketplace-images 8e54931 link false /test azure-ovn-marketplace-images
ci/prow/e2e-vsphere-externallb-ovn 8e54931 link false /test e2e-vsphere-externallb-ovn
ci/prow/e2e-azure-default-config 8e54931 link false /test e2e-azure-default-config
ci/prow/e2e-azure-ovn-resourcegroup 8e54931 link false /test e2e-azure-ovn-resourcegroup
ci/prow/e2e-azure-ovn-shared-vpc 8e54931 link false /test e2e-azure-ovn-shared-vpc
ci/prow/okd-scos-e2e-aws-ovn 8e54931 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a697e9b into openshift:main May 21, 2025
15 of 23 checks passed
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: Jira Issue OCPBUGS-56008: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-56008 has been moved to the MODIFIED state.

In response to this:

#9538 changed the installer's default behavior so that it no longer created user-assigned identities and attached them to VMs for Azure installs. OCPBUGS-56008 revealed that the acr-credentials-provider still depends on the node identity and the missing identities breaks pulling images from private registries in Azure.

Simply reverting #9538 would cause additional issues, particularly:

This PR resolves these issues and the bug by keeping the install config API in place, but changing the default back so the installer continues to create and attach identities by default. ARO & internal Red Hat developers (assuming they don't need ACR) can set installConfig.platform.azure.defaultMachinePool.identity.type: None so that no identity is created.

Upstream is working on a solution that will use credentials requests. Once that is landed, we can set the default identity type back to None and remove the code to create identities. Upstream tracker:
kubernetes/enhancements#4412

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-installer
This PR has been included in build ose-installer-container-v4.20.0-202505210746.p0.ga697e9b.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-baremetal-installer
This PR has been included in build ose-baremetal-installer-container-v4.20.0-202505210746.p0.ga697e9b.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-artifacts
This PR has been included in build ose-installer-artifacts-container-v4.20.0-202505210746.p0.ga697e9b.assembly.stream.el9.
All builds following this will include this PR.

@patrickdillon
Copy link
Contributor Author

/cherry-pick release-4.19

@openshift-cherrypick-robot

@patrickdillon: new pull request created: #9731

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants