Ensure all objects are reconciled at least once #1163

adrianludwin · 2020-09-30T16:23:42Z

While looking into a sporadic but reproducible (maybe 1/5 times) failure
of the rolebinding recreation test, I observed that the objects in the
parent namespace were occasionally not being reconciled after HNC
restarted. This appeared to be because the HC reconciler was not
modifying either the HC or the namespace, and so was skipping the update
of the object reconcilers. This resulted in source objects not being
inserted into the forest, and as a result, propagated objects were being
removed.

By adding a check that the object reconcilers are called when a
namespace is first added to the forest, we can be certain that all
objects are present in the forest.

Tested: ran the rolebinding e2e test 10/10 times successfully.

While looking into a sporadic but reproducible (maybe 1/5 times) failure of the rolebinding recreation test, I observed that the objects in the `parent` namespace were occasionally not being reconciled after HNC restarted. This appeared to be because the HC reconciler was not modifying either the HC or the namespace, and so was skipping the update of the object reconcilers. This resulted in source objects not being inserted into the forest, and as a result, propagated objects were being removed. By adding a check that the object reconcilers are called when a namespace is first added to the forest, we can be certain that all objects are present in the forest. Tested: ran the rolebinding e2e test 10/10 times successfully.

k8s-ci-robot · 2020-09-30T16:23:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianludwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adrianludwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianludwin · 2020-09-30T16:24:03Z

/assign @yiqigao217
/assign @GinnyJI
/cc @rjbez17

Made several debuggability and functional improvements: * Added timestamps to all output to correlate with logs. Timestamps are of the form "seconds since Unix epoch" which isn't human friendly but is identical to the timestamps produced by HNC logs. * Fully deleted the HNC deployment in RecoverHNC since HNC doesn't really seem to recover well if various things are changed without restarting the pod. * Make the post-recovery test far more robust by not ignoring failures to delete the test namespaces. * Stop force-deleting pods in the rolebinding test and instead just wait a few moments. I found that we don't actually need to wait for the pod object to be fully deleted on the server for it to stop getting in the way; the container in the pod appears to stop running ~instantly while the pod can occasionally hang around for over 60s. All of these were added as I saw failures in the affected code. Tested: Ran a set of five flaky tests with and without these changes (while also including PR kubernetes-retired#1163). Without, at least one of them failed virtually every time; with this change, they passed 5/5 times on GKE plus 2/2 times on Kind.

yiqigao217

While looking into a sporadic but reproducible (maybe 1/5 times) failure
of the rolebinding recreation test, I observed that the objects in the
parent namespace were occasionally not being reconciled after HNC
restarted. This appeared to be because the HC reconciler was not
modifying either the HC or the namespace, and so was skipping the update
of the object reconcilers.

Why it's flaky but not consistent failure if HC reconciler was not modifying objects? When would it modify objects and cause them to be enqueued to update forest and passed the test?

adrianludwin · 2020-09-30T18:55:44Z

While looking into a sporadic but reproducible (maybe 1/5 times) failure
of the rolebinding recreation test, I observed that the objects in the
parent namespace were occasionally not being reconciled after HNC
restarted. This appeared to be because the HC reconciler was not
modifying either the HC or the namespace, and so was skipping the update
of the object reconcilers.

Why it's flaky but not consistent failure if HC reconciler was not modifying objects? When would it modify objects and cause them to be enqueued to update forest and passed the test?

I believe it's because there's a race condition between reconciling the objects and the namespace. If the namespace is reconciled first, then when we reconcile the source object, it gets inserted into the forest. But if we reconcile the source object first, then the namespace doesn't exist in the forest and we just ignore the object, under the assumption that when we do sync the namespace, we'll enqueue all the objects in that namespace.

yiqigao217 · 2020-09-30T20:27:24Z

I believe it's because there's a race condition between reconciling the objects and the namespace. If the namespace is reconciled first, then when we reconcile the source object, it gets inserted into the forest. But if we reconcile the source object first, then the namespace doesn't exist in the forest and we just ignore the object, under the assumption that when we do sync the namespace, we'll enqueue all the objects in that namespace.

Sorry that I got a little confused. It seems to me that even before your change, if namespace is reconciled first, if there's no change, r.updateObjects(ctx, log, nm) won't get chance to be called?

adrianludwin · 2020-09-30T21:13:00Z

Sorry that I got a little confused. It seems to me that even before your change, if namespace is reconciled first, if there's no change, r.updateObjects(ctx, log, nm) won't get chance to be called?

That's correct - if the namespace is reconciled first, then r.updateObjects won't be called. But the problem is here:

https://github.com/kubernetes-sigs/multi-tenancy/blob/master/incubator/hnc/internal/reconcilers/object.go#L263

If the namespace is reconciled first, then even though r.updateObjects isn't called, when the object is reconciled second it will be put into the forest, and everything is fine. But if the object is reconciled first, then ns.Exists() returns false and the object is not placed into the forest. And then, because r.updateObjects isn't called, it's never reconciled again. This fix addresses that second case.

Does this make sense?

This change fixes todos in the cherrypick for kubernetes-retired#1150 (see issue kubernetes-retired#1149). It simplifies and restructures a lot of the logic to make it easier to follow while looking at less data (e.g. a lot more focus on anchor.Status.State). It also adds a lots more documentation. Tested: all e2e tests pass on GKE 1.18 when combined with fixes to the e2e tests (PRs kubernetes-retired#1160, kubernetes-retired#1162, kubernetes-retired#1163 and kubernetes-retired#1164).

yiqigao217 · 2020-10-01T14:12:03Z

Sorry that I got a little confused. It seems to me that even before your change, if namespace is reconciled first, if there's no change, r.updateObjects(ctx, log, nm) won't get chance to be called?

That's correct - if the namespace is reconciled first, then r.updateObjects won't be called. But the problem is here:

https://github.com/kubernetes-sigs/multi-tenancy/blob/master/incubator/hnc/internal/reconcilers/object.go#L263

If the namespace is reconciled first, then even though r.updateObjects isn't called, when the object is reconciled second it will be put into the forest, and everything is fine. But if the object is reconciled first, then ns.Exists() returns false and the object is not placed into the forest. And then, because r.updateObjects isn't called, it's never reconciled again. This fix addresses that second case.

Does this make sense?

Yes, makes sense now. So if namespace reconciles first, it would be fine since the object reconciler would not early exit with actionNop. And this PR is to make sure we don't early exit in the HC reconciler if the namespace is first-time synced and make sure objects are enqueued.

/lgtm

adrianludwin · 2020-10-01T14:25:37Z

Exactly!

…

On Thu, Oct 1, 2020 at 10:12 AM Yiqi Gao ***@***.***> wrote: Sorry that I got a little confused. It seems to me that even before your change, if namespace is reconciled first, if there's no change, r.updateObjects(ctx, log, nm) won't get chance to be called? That's correct - if the namespace is reconciled first, then r.updateObjects won't be called. But the problem is here: https://github.com/kubernetes-sigs/multi-tenancy/blob/master/incubator/hnc/internal/reconcilers/object.go#L263 If the namespace is reconciled first, then even though r.updateObjects isn't called, when the object is reconciled *second* it will be put into the forest, and everything is fine. But if the *object* is reconciled first, then ns.Exists() returns false and the object is *not* placed into the forest. And then, because r.updateObjects isn't called, it's never reconciled again. This fix addresses that second case. Does this make sense? Yes, makes sense now. So if namespace reconciles first, it would be fine since the object reconciler would not early exit with actionNop. And this PR is to make sure we don't early exit in the HC reconciler if the namespace is first-time synced and make sure objects are enqueued. /lgtm — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1163 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE43PZEE226AZI2RQSGUGBTSISE4JANCNFSM4R7JX6HQ> .

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2020

k8s-ci-robot requested review from Fei-Guo and tashimi September 30, 2020 16:23

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2020

k8s-ci-robot assigned GinnyJI and yiqigao217 Sep 30, 2020

k8s-ci-robot requested a review from rjbez17 September 30, 2020 16:24

adrianludwin mentioned this pull request Sep 30, 2020

Make e2e test repairs more robust #1164

Merged

yiqigao217 reviewed Sep 30, 2020

View reviewed changes

adrianludwin mentioned this pull request Sep 30, 2020

Don't delete subns if anchor has conflict (master) #1151

Merged

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020

k8s-ci-robot merged commit cc57d6c into kubernetes-retired:master Oct 1, 2020

adrianludwin deleted the reconcile-once branch October 5, 2020 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all objects are reconciled at least once #1163

Ensure all objects are reconciled at least once #1163

adrianludwin commented Sep 30, 2020

k8s-ci-robot commented Sep 30, 2020

adrianludwin commented Sep 30, 2020

yiqigao217 left a comment

adrianludwin commented Sep 30, 2020

yiqigao217 commented Sep 30, 2020

adrianludwin commented Sep 30, 2020

yiqigao217 commented Oct 1, 2020

adrianludwin commented Oct 1, 2020 via email

Ensure all objects are reconciled at least once #1163

Ensure all objects are reconciled at least once #1163

Conversation

adrianludwin commented Sep 30, 2020

k8s-ci-robot commented Sep 30, 2020

adrianludwin commented Sep 30, 2020

yiqigao217 left a comment

Choose a reason for hiding this comment

adrianludwin commented Sep 30, 2020

yiqigao217 commented Sep 30, 2020

adrianludwin commented Sep 30, 2020

yiqigao217 commented Oct 1, 2020

adrianludwin commented Oct 1, 2020 via email