Skip to content

Split replicated policy reconciles #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JustinKuli
Copy link
Member

The first 2 commits are refactors that are not all necessary for the other part of the PR. I can slim down those commits if it helps this PR review. I would recommend looking through the commits individually.

The new replicated policy reconciler helps with two things:

  • When a replicated policy is updated, other related replicated policies are not recalculated.
  • The root policy reconciler no longer creates/updates/deletes replicated policies itself, via custom concurrency logic. Instead, it simply sends events to the new reconciler for those policies to recalculate their correct states.

It does not "intelligently" handle updates to replication decisions, eg to not recalculate the replicated policy on managed clusters that should not have changed.

This also adds configurable concurrency to the root policy reconciliation, and the new reconciler has configurable concurrency. The new default root policy concurrency is 2, and the replicated policy concurrency is 10 (up from the default of 5 that we had in our bespoke concurrency logic).

@JustinKuli
Copy link
Member Author

For reference, the resource version caching reduced the number of reconciles done in the e2e tests by about 35%, from 577 to 364

@JustinKuli
Copy link
Member Author

JustinKuli commented Sep 27, 2023

I am worried that the test failure is because the replicated policy reconcile is now happening too fast? From the failing test:

Expected
  ... "label-vendor-test":"auto-detect\n", "label-vendor-test-two":"Fake\n" ...
to equal
  ... "label-vendor-test":"Fake\n", "label-vendor-test-two":"Fake\n" ...

both of these are from the managed cluster label - the first uses .ManagedClusterLabels, the second uses a lookup. So I wonder if the update to the ManagedCluster triggered the reconcile, but when the ManagedCluster was grabbed from the cache wasn't updated yet... because the ManagedCluster that the template-utils grab for the lookup is correct.

So I'm not really sure what to do about that. It seems to be happening inconsistently. Previously, I think if this happened, it would get handled by the duplicate reconcile.

@JustinKuli JustinKuli marked this pull request as ready for review September 27, 2023 18:59
@openshift-ci openshift-ci bot requested review from dhaiducek and gparvin September 27, 2023 19:00
@JustinKuli JustinKuli requested a review from mprahl September 27, 2023 19:00
@JustinKuli JustinKuli force-pushed the split-replicated-policy-reconciles branch from bfe9503 to 9602d12 Compare September 28, 2023 16:56
@mprahl
Copy link
Member

mprahl commented Oct 3, 2023

FYI, the review has moved to JustinKuli#9.

This commit only moves some code between files, it does change any
content.

Signed-off-by: Justin Kulikauskas <[email protected]>
This adds some documentation, as well as renames the functions to be
a bit more descriptive. Most code changes are only stylistic or to help
them match other similar functions.

Signed-off-by: Justin Kulikauskas <[email protected]>
The new Propagator type has (almost) all the methods that were on the
PolicyReconciler - the only exception is `Reconcile`, which is on the
new RootPolicyReconciler type. This should help organize things to make
a new ReplicatedPolicyReconciler type.

Signed-off-by: Justin Kulikauskas <[email protected]>
This new reconciler currently handles when a replicated policy changes
unexpectedly, and is triggered by events coming from the root policy
reconciler. This removes the custom concurrency for replicated policy
creation and deletion, and the concurrency here will be adjustable via
the more usual `MaxConcurrentReconciles` controller option.

Because the replicated policy reconciler changes the same resources that
trigger its reconciliation, it often reconciles the same resource twice
in a row. That needs to be addressed for this to be more performant than
the previous implementation.

When running the tests, it became apparent that the encryptionkeys
controller triggers reconciliation on the replicated policies by adding
an annotation to the root policy. This is another place for possible
improvements, it can likely also trigger replicated policy reconciles
directly.

Refs:
 - https://issues.redhat.com/browse/ACM-7332

Signed-off-by: Justin Kulikauskas <[email protected]>
Signed-off-by: Justin Kulikauskas <[email protected]>
This should help answer some performance questions in the propagator.
These metrics are not as good as running a real performance tests, but
they will provide some data, from tests we already run regularly.

Signed-off-by: Justin Kulikauskas <[email protected]>
@JustinKuli JustinKuli force-pushed the split-replicated-policy-reconciles branch from 9602d12 to 63f618c Compare October 5, 2023 18:29
@JustinKuli JustinKuli force-pushed the split-replicated-policy-reconciles branch from 63f618c to 0683a4f Compare October 5, 2023 19:37
Based on metrics from the e2e tests, this may reduce the number of
replicated policy reconciles by 35%.

Events from the dependency watcher are artificially delayed on the
replicated policy reconcile queue - the watch there and a watch for the
controller-runtime cache on ManagedCluster labels will race each other,
and if the cache is old during the reconcile, the needed update will
be missed. This is not the best way to resolve the race: it would be
better to ensure that both use the same watch or cache somehow.

Refs:
 - https://issues.redhat.com/browse/ACM-7332

Signed-off-by: Justin Kulikauskas <[email protected]>

update resource version cache
This increases the root policy concurrency from 1 to 2, and makes it
configurable. It also adds concurrency to the replicated policy - the
default is 10, up from the concurrency it replaces which defaulted to 5.

Some refactoring was done to have all reconciler concurrencies be
configured uniformly.

Signed-off-by: Justin Kulikauskas <[email protected]>
The bigger change is that the status is now updated before sending the
events to the replicated policy reconciler - this should help prevent
some requeues. Otherwise, these are largely just organizational changes.

Signed-off-by: Justin Kulikauskas <[email protected]>
@JustinKuli JustinKuli force-pushed the split-replicated-policy-reconciles branch from 0683a4f to 0997a38 Compare October 6, 2023 14:43
@JustinKuli
Copy link
Member Author

Added some changes after more discussion with @mprahl. In particular, a race condition where the predicates might not correctly filter out events caused by the reconciler itself (causing an unnecessary duplicate reconcile) has been addressed by adding a set of locks.

AFAIK, there are not any more currently requested changes to this PR; it should be ready for another look.

defer version.Unlock()

// Store this to ensure the cache matches a known possible state for this situation
version.resourceVersion = "deleted"
Copy link
Member

@mprahl mprahl Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed here since what you did shouldn't trigger a reconcile?

Copy link
Member

@mprahl mprahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! This looks really good and thank you for all of your hard work on this.

@openshift-ci
Copy link

openshift-ci bot commented Oct 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JustinKuli, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit b2aa21d into open-cluster-management-io:main Oct 6, 2023
JustinKuli added a commit to JustinKuli/governance-policy-framework that referenced this pull request Oct 12, 2023
A change in the propagator split up the `policy-propagator` controller
into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So
the old metrics are no longer reported, and these tests should check for
the new ones.

Refs:
 - open-cluster-management-io/governance-policy-propagator#130

Signed-off-by: Justin Kulikauskas <[email protected]>
openshift-ci bot pushed a commit to stolostron/governance-policy-framework that referenced this pull request Oct 16, 2023
A change in the propagator split up the `policy-propagator` controller
into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So
the old metrics are no longer reported, and these tests should check for
the new ones.

Refs:
 - open-cluster-management-io/governance-policy-propagator#130

Signed-off-by: Justin Kulikauskas <[email protected]>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/governance-policy-framework that referenced this pull request Oct 18, 2023
A change in the propagator split up the `policy-propagator` controller
into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So
the old metrics are no longer reported, and these tests should check for
the new ones.

Refs:
 - open-cluster-management-io/governance-policy-propagator#130

Signed-off-by: Justin Kulikauskas <[email protected]>
openshift-ci bot pushed a commit to stolostron/governance-policy-framework that referenced this pull request Oct 18, 2023
A change in the propagator split up the `policy-propagator` controller
into a 2 controllers: `root-policy-spec`, and `replicated-policy`. So
the old metrics are no longer reported, and these tests should check for
the new ones.

Refs:
 - open-cluster-management-io/governance-policy-propagator#130

Signed-off-by: Justin Kulikauskas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants