Split replicated policy reconciles #130

JustinKuli · 2023-09-27T18:26:04Z

The first 2 commits are refactors that are not all necessary for the other part of the PR. I can slim down those commits if it helps this PR review. I would recommend looking through the commits individually.

The new replicated policy reconciler helps with two things:

When a replicated policy is updated, other related replicated policies are not recalculated.
The root policy reconciler no longer creates/updates/deletes replicated policies itself, via custom concurrency logic. Instead, it simply sends events to the new reconciler for those policies to recalculate their correct states.

It does not "intelligently" handle updates to replication decisions, eg to not recalculate the replicated policy on managed clusters that should not have changed.

This also adds configurable concurrency to the root policy reconciliation, and the new reconciler has configurable concurrency. The new default root policy concurrency is 2, and the replicated policy concurrency is 10 (up from the default of 5 that we had in our bespoke concurrency logic).

JustinKuli · 2023-09-27T18:32:12Z

For reference, the resource version caching reduced the number of reconciles done in the e2e tests by about 35%, from 577 to 364

JustinKuli · 2023-09-27T18:58:39Z

I am worried that the test failure is because the replicated policy reconcile is now happening too fast? From the failing test:

Expected
  ... "label-vendor-test":"auto-detect\n", "label-vendor-test-two":"Fake\n" ...
to equal
  ... "label-vendor-test":"Fake\n", "label-vendor-test-two":"Fake\n" ...

both of these are from the managed cluster label - the first uses .ManagedClusterLabels, the second uses a lookup. So I wonder if the update to the ManagedCluster triggered the reconcile, but when the ManagedCluster was grabbed from the cache wasn't updated yet... because the ManagedCluster that the template-utils grab for the lookup is correct.

So I'm not really sure what to do about that. It seems to be happening inconsistently. Previously, I think if this happened, it would get handled by the duplicate reconcile.

mprahl · 2023-10-03T14:50:37Z

FYI, the review has moved to JustinKuli#9.

This commit only moves some code between files, it does change any content. Signed-off-by: Justin Kulikauskas <[email protected]>

This adds some documentation, as well as renames the functions to be a bit more descriptive. Most code changes are only stylistic or to help them match other similar functions. Signed-off-by: Justin Kulikauskas <[email protected]>

The new Propagator type has (almost) all the methods that were on the PolicyReconciler - the only exception is `Reconcile`, which is on the new RootPolicyReconciler type. This should help organize things to make a new ReplicatedPolicyReconciler type. Signed-off-by: Justin Kulikauskas <[email protected]>

This new reconciler currently handles when a replicated policy changes unexpectedly, and is triggered by events coming from the root policy reconciler. This removes the custom concurrency for replicated policy creation and deletion, and the concurrency here will be adjustable via the more usual `MaxConcurrentReconciles` controller option. Because the replicated policy reconciler changes the same resources that trigger its reconciliation, it often reconciles the same resource twice in a row. That needs to be addressed for this to be more performant than the previous implementation. When running the tests, it became apparent that the encryptionkeys controller triggers reconciliation on the replicated policies by adding an annotation to the root policy. This is another place for possible improvements, it can likely also trigger replicated policy reconciles directly. Refs: - https://issues.redhat.com/browse/ACM-7332 Signed-off-by: Justin Kulikauskas <[email protected]>

Signed-off-by: Justin Kulikauskas <[email protected]>

This should help answer some performance questions in the propagator. These metrics are not as good as running a real performance tests, but they will provide some data, from tests we already run regularly. Signed-off-by: Justin Kulikauskas <[email protected]>

Based on metrics from the e2e tests, this may reduce the number of replicated policy reconciles by 35%. Events from the dependency watcher are artificially delayed on the replicated policy reconcile queue - the watch there and a watch for the controller-runtime cache on ManagedCluster labels will race each other, and if the cache is old during the reconcile, the needed update will be missed. This is not the best way to resolve the race: it would be better to ensure that both use the same watch or cache somehow. Refs: - https://issues.redhat.com/browse/ACM-7332 Signed-off-by: Justin Kulikauskas <[email protected]> update resource version cache

This increases the root policy concurrency from 1 to 2, and makes it configurable. It also adds concurrency to the replicated policy - the default is 10, up from the concurrency it replaces which defaulted to 5. Some refactoring was done to have all reconciler concurrencies be configured uniformly. Signed-off-by: Justin Kulikauskas <[email protected]>

The bigger change is that the status is now updated before sending the events to the replicated policy reconciler - this should help prevent some requeues. Otherwise, these are largely just organizational changes. Signed-off-by: Justin Kulikauskas <[email protected]>

JustinKuli · 2023-10-06T16:37:07Z

Added some changes after more discussion with @mprahl. In particular, a race condition where the predicates might not correctly filter out events caused by the reconciler itself (causing an unnecessary duplicate reconcile) has been addressed by adding a set of locks.

AFAIK, there are not any more currently requested changes to this PR; it should be ready for another look.

mprahl · 2023-10-06T19:47:16Z

controllers/propagator/replicatedpolicy_controller.go

+			defer version.Unlock()
+
+			// Store this to ensure the cache matches a known possible state for this situation
+			version.resourceVersion = "deleted"


Is this needed here since what you did shouldn't trigger a reconcile?

controllers/propagator/replicatedpolicy_controller.go

mprahl

Nice job! This looks really good and thank you for all of your hard work on this.

openshift-ci · 2023-10-06T20:08:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JustinKuli, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JustinKuli,mprahl]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

A change in the propagator split up the `policy-propagator` controller into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So the old metrics are no longer reported, and these tests should check for the new ones. Refs: - open-cluster-management-io/governance-policy-propagator#130 Signed-off-by: Justin Kulikauskas <[email protected]>

openshift-ci bot added do-not-merge/work-in-progress dco-signoff: yes approved labels Sep 27, 2023

JustinKuli marked this pull request as ready for review September 27, 2023 18:59

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 27, 2023

openshift-ci bot requested review from dhaiducek and gparvin September 27, 2023 19:00

JustinKuli requested a review from mprahl September 27, 2023 19:00

JustinKuli force-pushed the split-replicated-policy-reconciles branch from bfe9503 to 9602d12 Compare September 28, 2023 16:56

openshift-merge-robot added the needs-rebase label Oct 3, 2023

JustinKuli added 6 commits October 5, 2023 14:08

Refactor propagator reconciler setup to one file

96d58a1

This commit only moves some code between files, it does change any content. Signed-off-by: Justin Kulikauskas <[email protected]>

Refactor predicates and mappers: PolicyReconciler

95ad42f

This adds some documentation, as well as renames the functions to be a bit more descriptive. Most code changes are only stylistic or to help them match other similar functions. Signed-off-by: Justin Kulikauskas <[email protected]>

Remove old concurrency pieces

71d95e0

Signed-off-by: Justin Kulikauskas <[email protected]>

JustinKuli force-pushed the split-replicated-policy-reconciles branch from 9602d12 to 63f618c Compare October 5, 2023 18:29

openshift-merge-robot removed the needs-rebase label Oct 5, 2023

JustinKuli force-pushed the split-replicated-policy-reconciles branch from 63f618c to 0683a4f Compare October 5, 2023 19:37

JustinKuli added 3 commits October 6, 2023 10:40

JustinKuli force-pushed the split-replicated-policy-reconciles branch from 0683a4f to 0997a38 Compare October 6, 2023 14:43

mprahl reviewed Oct 6, 2023

View reviewed changes

controllers/propagator/replicatedpolicy_controller.go Show resolved Hide resolved

mprahl reviewed Oct 6, 2023

View reviewed changes

controllers/propagator/replicatedpolicy_controller.go Show resolved Hide resolved

mprahl reviewed Oct 6, 2023

View reviewed changes

controllers/propagator/replicatedpolicy_controller.go Show resolved Hide resolved

mprahl approved these changes Oct 6, 2023

View reviewed changes

openshift-ci bot assigned mprahl Oct 6, 2023

openshift-ci bot added the lgtm label Oct 6, 2023

openshift-ci bot merged commit b2aa21d into open-cluster-management-io:main Oct 6, 2023

magic-mirror-bot bot mentioned this pull request Oct 6, 2023

😿 Failed to sync the upstream PRs: #130 stolostron/governance-policy-propagator#455

Closed

JustinKuli mentioned this pull request Oct 10, 2023

Sync from open-cluster-management-io/governance-policy-propagator: #130 stolostron/governance-policy-propagator#459

Merged

JustinKuli mentioned this pull request Oct 12, 2023

Update the controller names from the propagator stolostron/governance-policy-framework#664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split replicated policy reconciles #130

Split replicated policy reconciles #130

Uh oh!

JustinKuli commented Sep 27, 2023

Uh oh!

JustinKuli commented Sep 27, 2023

Uh oh!

JustinKuli commented Sep 27, 2023 •

edited

Loading

Uh oh!

mprahl commented Oct 3, 2023

Uh oh!

JustinKuli commented Oct 6, 2023

Uh oh!

mprahl Oct 6, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mprahl left a comment

Uh oh!

openshift-ci bot commented Oct 6, 2023

Uh oh!

Uh oh!

Split replicated policy reconciles #130

Split replicated policy reconciles #130

Uh oh!

Conversation

JustinKuli commented Sep 27, 2023

Uh oh!

JustinKuli commented Sep 27, 2023

Uh oh!

JustinKuli commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mprahl commented Oct 3, 2023

Uh oh!

JustinKuli commented Oct 6, 2023

Uh oh!

mprahl Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mprahl left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Oct 6, 2023

Uh oh!

Uh oh!

JustinKuli commented Sep 27, 2023 •

edited

Loading

mprahl Oct 6, 2023 •

edited

Loading