Skip to content

Add support for a custom service account for hub templates #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 5, 2024
Merged

Add support for a custom service account for hub templates #236

merged 2 commits into from
Aug 5, 2024

Conversation

mprahl
Copy link
Member

@mprahl mprahl commented Aug 1, 2024

A new field of spec.hubTemplateOptions.serviceAccountName can be set to leverage a service account in the root policy namespace to resolve hub templates instead of the restricted hub templating of referencing only objects in the root policy namespace.

This leverages the TokenRequest API to get short-lived tokens of the service accounts and periodically refreshes them before expiration.

Relates:
https://issues.redhat.com/browse/ACM-12790

@@ -98,6 +102,9 @@ type PolicySpec struct {
// PolicyDependencies is a list of dependency objects detailed with extra considerations for
// compliance that should be fulfilled before applying the policies to the managed clusters.
Dependencies []PolicyDependency `json:"dependencies,omitempty"`

// HubTemplateOptions changes the default behavior of hub templates.
HubTemplateOptions *HubTemplateOptions `json:"hubTemplateOptions,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pointer so that the omitempty has effect rather than returning an empty struct when hubTemplateOptions is unset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create HubTemplateOptions.ServiceAccountName? because it is only one property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it but I chose it in case we make other things customizable and I thought a hubTemplateServiceAccountName field was too long lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping more pretty and this is good for future maybe

if errors.Is(tmplErr, ErrRetryable) {
// Return the error if it's retryable, which will utilize controller-runtime's exponential backoff.
return reconcile.Result{}, tmplErr
} else if errors.Is(tmplErr, ErrSAMissing) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because template errors that are not retryable must continue so that the replicated policy has the hub template error annotation set.

@mprahl
Copy link
Member Author

mprahl commented Aug 1, 2024

I'm not sure what's going on with the test. It's failing every time in CI but working every time locally...

g.Expect(err).ToNot(HaveOccurred())
g.Expect(refreshedToken).ToNot(BeNil())
g.Expect(bytes.Equal(token, refreshedToken)).To(BeFalse(), "Expected the token to have been refreshed")
}, 70, 3).Should(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 70 not 60?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it could be exactly 60 seconds before the token refreshes, so I gave it a bit of a buffer. I could have made it 65 seconds though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a big deal as long as they change it. I don't think no one complain that it is over 1min

}

// RemovePolicy will clean up watches on the current template resolver (service account used in the last call to
// GetResolver) and if this was the last policy using this template resolver, the template resolver will be cleaned up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root policy or replicated policy? It is confusing. Base on referenceCount description it is a replicated policy right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's a replicated policy. I can rename the method or clarify the description. Any preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveRePolicy? I don't know Dale is good at this lol

globalLock sync.RWMutex
// The default template resolver when a replicated policy does not specify a service account.
defaultTemplateResolver *templates.TemplateResolver
// A map of service acconts to template resolvers. Access to this must leverage globalLock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A map of service acconts to template resolvers. Access to this must leverage globalLock.
// A map of service acconts to templateResolverWithCancel. Access to this must leverage globalLock.

??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or template resolver single

@yiraeChristineKim
Copy link
Contributor

This is amazing! Yare are the best expert Go concurrency

@mprahl
Copy link
Member Author

mprahl commented Aug 2, 2024

/hold for reviews

@@ -408,7 +432,5 @@ func (r *ReplicatedPolicyReconciler) processTemplates(
}
}

log.V(1).Info("Successfully processed templates")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message is removed because processTemplates is now always called and when no templates are present, it does nothing. This is to clean up watches if hub templates used to be present but are now removed.

Copy link
Member

@JustinKuli JustinKuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems about right 😅

Comment on lines +337 to +351
var returnErr error

// Retry template errors due to permission issues. This isn't ideal, but there's no good event driven way to
// be notified when the permissions are given to the service account.
if k8serrors.IsForbidden(tmplErr) {
returnErr = tmplErr
}

if watcherErr != nil {
log.Info("Requeueing for the dynamic watcher error")

returnErr = watcherErr
}

return reconcile.Result{}, watcherErr
return reconcile.Result{}, returnErr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if customizing the retry interval for Forbidden errors might be nice. So maybe something like this:

	if watcherErr != nil {
		log.Info("Requeueing for the dynamic watcher error")

		return reconcile.Result{}, watcherErr
	}

	// Retry template errors due to permission issues. This isn't ideal, but there's no good event driven way to
	// be notified when the permissions are given to the service account.
	if k8serrors.IsForbidden(tmplErr) {
		log.Info("Requeueing in 2 minutes for the permission issue")
		
		// returning the error would override the RequeueAfter setting
		return reconcile.Result{RequeueAfter: 2 * time.Minute}, nil
	}

	return reconicile.Result{}, nil

// the template resolver will be cleaned up. A pointer is used since the reference count can be copied to a new
// templateResolverWithCancel instance if the TemplateResolver instance was unexpectedly canceled and was
// restarted.
referenceCount *atomic.Uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't shake the feeling that us doing our own reference counting means we're missing some other way to do this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other way I could think of that would work other than periodically checking to see if any root policy still references this service account was to have a map of service accounts to replicated policies. I thought the reference count was better to avoid locking around a shared map. It also saves a bit of memory.

Another way that came to mind was using a read and write lock for each template resolver and all replicated policies have a read lock and then a separate goroutine waits for a write lock, which would indicate there are no more readers left to then clean up the template resolver. The problem is that the read and write lock in the Go standard library doesn't work like that as it prioritizes giving the write lock when it can, which makes sense.

mprahl added 2 commits August 5, 2024 13:28
A new field of spec.hubTemplateOptions.serviceAccountName can be set to
leverage a service account in the root policy namespace to resolve hub
templates instead of the restricted hub templating of referencing only
objects in the root policy namespace.

This leverages the TokenRequest API to get short-lived tokens of the
service accounts and periodically refreshes them before expiration.

Relates:
https://issues.redhat.com/browse/ACM-12790

Signed-off-by: mprahl <[email protected]>
This was covering up an error during test development.

Signed-off-by: mprahl <[email protected]>
@mprahl mprahl requested a review from JustinKuli August 5, 2024 17:29
Copy link
Member

@JustinKuli JustinKuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

Copy link

openshift-ci bot commented Aug 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JustinKuli, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mprahl
Copy link
Member Author

mprahl commented Aug 5, 2024

/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit fe30aeb into open-cluster-management-io:main Aug 5, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants