add KEP 238, to add controller revision #261

Edwinhr716 · 2024-11-20T18:32:06Z

What type of PR is this?

/kind documentation

What this PR does / why we need it

Kep for the fix for #238 #240 #281

Which issue(s) this PR fixes

Fixes #

Special notes for your reviewer

Does this PR introduce a user-facing change?

Edwinhr716 · 2024-11-25T23:29:57Z

@kerthcet PTAL as well

keps/238-controller-revision/README.md

ahg-g · 2024-12-10T07:21:17Z

keps/238-controller-revision/README.md

+the value of the template hash generated by the LWS object, with the template hash that the leader pod 
+hash. Because the leader pod spec is determined by the statefulset controller, there is a guarantee that 
+it will always have the right pod spec. So, if the template hashes don't match, it means that the leader pod was created 
+using the old pod spec, meaning the old worker pod spec needs to be used.


what if there was another update that caused this difference? or is that not possible (meaning do we block updates if an update is currently in progress)?

We don't block updates if an update is in progress.

It is a good point, there's a scenario where another update causes the difference. If an update is started from 1 to 2, and then is updated to 3, if the update to 3 happens when the leader pod has been updated to 2, but the worker pod hasn't, then the worker pod will be created with 1, causing the discrepancy again.

To address this, we could do something like this

func constructWorkerStatefulSetApplyConfiguration(currentRevision) { if updatedTemplateHash != leaderPod.Labels[templateHash] { originalLws, err := ApplyRevision(&lws, currentRevision) podTemplateSpec = *originalLws.WorkerTemplate.DeepCopy() if !TemplateHashMatches(leaderPod.Labels[templateHash], originalLws) { originalLws, err := GetRevision(leaderPod.Labels[templateHash]) } } } // iterates through the list of revisions and returns the lws object with the matching template hash func GetRevision(string templateHash) { revisions := ListRevisions() for _, revision := range revisions { lws, err := ApplyRevision(&lws, revision) if lws.Labels[templateHash] == templateHash { return lws } } }

Wdyt?

sounds good, so we try to find the correct revision by finding the one with a matching hash, makes sense.

ahg-g · 2024-12-10T07:29:57Z

keps/238-controller-revision/README.md

+}
+```
+
+Once the update has been determined to be done, `currentRevision` will be set to be the value of `updateRevision`


can you clarify when updateRevision gets created/set?

updateRevision gets created when there isn't an existing revision that has already been created. So the pseudocode is

func GetLeaderWorkerSetRevisions(ctx, r.Client, lws) { revisions := ListControllerRevisions() updateRevision := NewRevision(lws) equalRevisions := FindEqualRevisions(revisions, updateRevision) if len(equalRevisions) == 0 { CreateControllerRevision(lws, updateRevision) }else if equalRevision(revisions[len(revisions) - 1], equalRevisions[equalCount-1]) { // in this case, updateRevision is the same as currentRevision, so no update is occuring updateRevision = equalRevisions[equalCount-1] } }

CurrentRevision doesn't get updated/changed unless lws.CurrentRevision is nil. So during an update, updateRevision and currentRevision will be different until the update is complete.

Edwinhr716 · 2024-12-10T16:55:55Z

/retest

ahg-g · 2024-12-12T03:59:05Z

keps/238-controller-revision/README.md

+// LeaderWorkerSetStatus defines the observed state of LeaderWorkerSet
+type LeaderWorkerSetStatus struct {
+	// currentRevision, if not empty, indicates the version of lws
+	// used to generate the worker pods in sequence [0,currentReplicas)


can you define what is currentReplicas, replicas and updatedReplicas?

ahg-g

I think we can simplify this as follows:

Create a ControllerRevision each time we set the leader sts with a different template hash (here);
In the pod reconciler, we simply lookup the revision with a matching template hash
Truncate history each time an update is done, leaving only the controllerRevision with a matching hash of the current template

With this, no need for maintain anything in status.

Edwinhr716 · 2024-12-13T18:12:18Z

Makes sense, I'll update the KEP with those changes, and add changes to fix #281

ahg-g

Discussed offline:

Nothing will be tracked in lws status
We will use the template hash as a label on each new controller revision that gets created. This key is stored as a label on the leaders sts.

On each lws reconcile:

Use the hash key on the leader sts to lookup the latest controller revision
If one exists, compare the existing controller revision with the current lws spec:
- If different, trigger an update
- If the same, don't trigger an update
If no one exists, create one and don't trigger an update
Each time an update is done, trim the controller revisions and only keep the one with the same hash key as the current leaders sts

One each leader pod reconcile:

Use the hash key to lookup the controller revision
Use the worker template in the controller revision to create workers sts

ahg-g · 2024-12-17T01:01:45Z

keps/238-controller-revision/README.md

+func templateUpdated(sts, lws) bool {
+	controllerRevision := GetLeaderWorkerSetRevisionFromTemplateHash(sts.Labels[templateHash])
+	baselineLws:= controllerutils.ApplyRevision(lws, controllerRevision)
+	return !utils.EqualLeaderWorkerTemplates(baselineLws, lws)


we also need to check some parameters of the LWS spec itself, specifically the subdomain policy

Yes, equalLeaderWorkerTemplates would check both lws.spec.leaderWorkerTemplate and checking subdomainPolicy, also treating networkConfig == nil as subdomainShared just like it is done for templateHash

ahg-g · 2024-12-18T05:25:26Z

/lgtm
/approve

k8s-ci-robot · 2024-12-18T05:25:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Edwinhr716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Edwinhr716 added 2 commits November 20, 2024 18:10

added initial files

2352115

cleanup

660e960

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 20, 2024

k8s-ci-robot requested review from ahg-g and liurupeng November 20, 2024 18:32

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 20, 2024

ahg-g reviewed Dec 10, 2024

View reviewed changes

typo

90277d8

ahg-g reviewed Dec 12, 2024

View reviewed changes

ahg-g reviewed Dec 16, 2024

View reviewed changes

added design changes discussed

24543d3

ahg-g reviewed Dec 17, 2024

View reviewed changes

ahg-g added the tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. label Dec 18, 2024

k8s-ci-robot assigned ahg-g Dec 18, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024

k8s-ci-robot merged commit a09fe75 into kubernetes-sigs:main Dec 18, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add KEP 238, to add controller revision #261

add KEP 238, to add controller revision #261

Edwinhr716 commented Nov 20, 2024 •

edited

Loading

Edwinhr716 commented Nov 25, 2024

ahg-g Dec 10, 2024

Edwinhr716 Dec 10, 2024

ahg-g Dec 12, 2024

ahg-g Dec 10, 2024

Edwinhr716 Dec 10, 2024 •

edited

Loading

Edwinhr716 commented Dec 10, 2024

ahg-g Dec 12, 2024 •

edited

Loading

ahg-g left a comment

Edwinhr716 commented Dec 13, 2024

ahg-g left a comment

ahg-g Dec 17, 2024

Edwinhr716 Dec 18, 2024 •

edited

Loading

ahg-g commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

add KEP 238, to add controller revision #261

add KEP 238, to add controller revision #261

Conversation

Edwinhr716 commented Nov 20, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Edwinhr716 commented Nov 25, 2024

ahg-g Dec 10, 2024

Choose a reason for hiding this comment

Edwinhr716 Dec 10, 2024

Choose a reason for hiding this comment

ahg-g Dec 12, 2024

Choose a reason for hiding this comment

ahg-g Dec 10, 2024

Choose a reason for hiding this comment

Edwinhr716 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Edwinhr716 commented Dec 10, 2024

ahg-g Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Edwinhr716 commented Dec 13, 2024

ahg-g left a comment

Choose a reason for hiding this comment

ahg-g Dec 17, 2024

Choose a reason for hiding this comment

Edwinhr716 Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

ahg-g commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

Edwinhr716 commented Nov 20, 2024 •

edited

Loading

Edwinhr716 Dec 10, 2024 •

edited

Loading

ahg-g Dec 12, 2024 •

edited

Loading

Edwinhr716 Dec 18, 2024 •

edited

Loading