-
Notifications
You must be signed in to change notification settings - Fork 22
Split replicated policy reconciles #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split replicated policy reconciles #130
Conversation
I am worried that the test failure is because the replicated policy reconcile is now happening too fast? From the failing test:
both of these are from the managed cluster label - the first uses So I'm not really sure what to do about that. It seems to be happening inconsistently. Previously, I think if this happened, it would get handled by the duplicate reconcile. |
bfe9503
to
9602d12
Compare
FYI, the review has moved to JustinKuli#9. |
This commit only moves some code between files, it does change any content. Signed-off-by: Justin Kulikauskas <[email protected]>
This adds some documentation, as well as renames the functions to be a bit more descriptive. Most code changes are only stylistic or to help them match other similar functions. Signed-off-by: Justin Kulikauskas <[email protected]>
The new Propagator type has (almost) all the methods that were on the PolicyReconciler - the only exception is `Reconcile`, which is on the new RootPolicyReconciler type. This should help organize things to make a new ReplicatedPolicyReconciler type. Signed-off-by: Justin Kulikauskas <[email protected]>
This new reconciler currently handles when a replicated policy changes unexpectedly, and is triggered by events coming from the root policy reconciler. This removes the custom concurrency for replicated policy creation and deletion, and the concurrency here will be adjustable via the more usual `MaxConcurrentReconciles` controller option. Because the replicated policy reconciler changes the same resources that trigger its reconciliation, it often reconciles the same resource twice in a row. That needs to be addressed for this to be more performant than the previous implementation. When running the tests, it became apparent that the encryptionkeys controller triggers reconciliation on the replicated policies by adding an annotation to the root policy. This is another place for possible improvements, it can likely also trigger replicated policy reconciles directly. Refs: - https://issues.redhat.com/browse/ACM-7332 Signed-off-by: Justin Kulikauskas <[email protected]>
Signed-off-by: Justin Kulikauskas <[email protected]>
This should help answer some performance questions in the propagator. These metrics are not as good as running a real performance tests, but they will provide some data, from tests we already run regularly. Signed-off-by: Justin Kulikauskas <[email protected]>
9602d12
to
63f618c
Compare
63f618c
to
0683a4f
Compare
Based on metrics from the e2e tests, this may reduce the number of replicated policy reconciles by 35%. Events from the dependency watcher are artificially delayed on the replicated policy reconcile queue - the watch there and a watch for the controller-runtime cache on ManagedCluster labels will race each other, and if the cache is old during the reconcile, the needed update will be missed. This is not the best way to resolve the race: it would be better to ensure that both use the same watch or cache somehow. Refs: - https://issues.redhat.com/browse/ACM-7332 Signed-off-by: Justin Kulikauskas <[email protected]> update resource version cache
This increases the root policy concurrency from 1 to 2, and makes it configurable. It also adds concurrency to the replicated policy - the default is 10, up from the concurrency it replaces which defaulted to 5. Some refactoring was done to have all reconciler concurrencies be configured uniformly. Signed-off-by: Justin Kulikauskas <[email protected]>
The bigger change is that the status is now updated before sending the events to the replicated policy reconciler - this should help prevent some requeues. Otherwise, these are largely just organizational changes. Signed-off-by: Justin Kulikauskas <[email protected]>
0683a4f
to
0997a38
Compare
Added some changes after more discussion with @mprahl. In particular, a race condition where the predicates might not correctly filter out events caused by the reconciler itself (causing an unnecessary duplicate reconcile) has been addressed by adding a set of locks. AFAIK, there are not any more currently requested changes to this PR; it should be ready for another look. |
defer version.Unlock() | ||
|
||
// Store this to ensure the cache matches a known possible state for this situation | ||
version.resourceVersion = "deleted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed here since what you did shouldn't trigger a reconcile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! This looks really good and thank you for all of your hard work on this.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JustinKuli, mprahl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
A change in the propagator split up the `policy-propagator` controller into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So the old metrics are no longer reported, and these tests should check for the new ones. Refs: - open-cluster-management-io/governance-policy-propagator#130 Signed-off-by: Justin Kulikauskas <[email protected]>
A change in the propagator split up the `policy-propagator` controller into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So the old metrics are no longer reported, and these tests should check for the new ones. Refs: - open-cluster-management-io/governance-policy-propagator#130 Signed-off-by: Justin Kulikauskas <[email protected]>
A change in the propagator split up the `policy-propagator` controller into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So the old metrics are no longer reported, and these tests should check for the new ones. Refs: - open-cluster-management-io/governance-policy-propagator#130 Signed-off-by: Justin Kulikauskas <[email protected]>
A change in the propagator split up the `policy-propagator` controller into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So the old metrics are no longer reported, and these tests should check for the new ones. Refs: - open-cluster-management-io/governance-policy-propagator#130 Signed-off-by: Justin Kulikauskas <[email protected]>
The first 2 commits are refactors that are not all necessary for the other part of the PR. I can slim down those commits if it helps this PR review. I would recommend looking through the commits individually.
The new replicated policy reconciler helps with two things:
It does not "intelligently" handle updates to replication decisions, eg to not recalculate the replicated policy on managed clusters that should not have changed.
This also adds configurable concurrency to the root policy reconciliation, and the new reconciler has configurable concurrency. The new default root policy concurrency is
2
, and the replicated policy concurrency is10
(up from the default of5
that we had in our bespoke concurrency logic).