-
Notifications
You must be signed in to change notification settings - Fork 69
Add KEP #257: Leader Only SubGroup #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Edwinhr716 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I may review this tomorrow because of limited bandwidth, sorry for that. |
reviewing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several comments.
const ( | ||
SubGroupPolicyLeaderWorker SubGroupPolicyType = "LeaderWorker" | ||
|
||
SubGroupPolicyLeaderOnly SubGroupPolicyType = "LeaderOnly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think LeaderOnly
is a bit hard to understand, what about LeaderAlone
vs All
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with LeaderAlone
, but All
sounds a bit ambiguous to me. LeaderAlone
vs LeaderWorker
maybe? Or something that specifies that the leader and a worker pod will be part of the same subgroup
to the number of workers (so size - 1). | ||
|
||
### Non-Goals | ||
This KEP assumes that the leader will not request TPU resources, and thus, subGroupSize will always be assumed to be an odd number. Moreover, no TPU environment variables will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, is this only work for TPU? I mean the exclusive annotation, I forgot it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both TPU and GPU are supported for exclusive placement. I'll rephrase to be accelerator agnostic
|
||
* `leaderworkerset.sigs.k8s.io/subgroup-policy-type` | ||
|
||
In order to keep backwards compatability, it will only be added if the type is `LeaderOnly`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even we add the annotation for both types is still backward compatibility. No behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario that I see is that there is an existing deployment using SubGroupPolicy
which then upgrades the LWS controller. After the upgrade, it will inject the annotation on the PodSpec, which will then trigger an update at the leader Sts level since the Pod has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, this is disgusting, usually we only reconcile when spec changes.
} | ||
``` | ||
|
||
Because the leader now occupies a subgroup, and `SubGroupSize` will not always be one, the way we calculate the SubGroupIndex needs to be modified. When generating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One argue here is should we mark the leader a sub group? What's the Cons if we don't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, hadn't considered not making the leader part of a subgroup at all. Doing it this way simplifies the rest of the changes (though the TPU environment injection will still need some adjustment). Will test it out and add to the Alternatives
which ever version we don't implement
Changed implementation to the leader not actually being part of any subgroup since it is simpler, added original implementation to Alternatives. Also changed LeaderOnly to LeaderAlone. Please take another look @kerthcet |
In order to keep backwards compatability, it will only be added if the type is `LeaderAlone`. | ||
|
||
### Subgroup Creation | ||
Implementation wise, the only change needed is to not add the SubGroup labels on the leader if the SubGroupType is LeaderAlone. Effectively, this means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexing within the group will be different now. Currently we don't define an index within the group, we do it only for tpuWorkerId, and this should be calculated differently now to exclude the leader: https://github.com/kubernetes-sigs/lws/blob/main/pkg/utils/accelerators/tpu.go#L113, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do this when the leader doesn't request TPU resources https://github.com/kubernetes-sigs/lws/blob/main/pkg/utils/accelerators/tpu.go#L115-L117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave more implementation details to the PR.
Co-authored-by: Abdullah Gharaibeh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the calculation of worker subGroup index still needs to change, for example, we set the subgroupSize=2,
- before with group size = 6, it looks like (0, 0), (1, 0), (2, 1), (3, 1), (4, 2), (5 ,2)
- after with group size = 7, it looks like (0, x), (1, 0), (2, 0), (3, 1), (4, 1), (5, 2), (6, 2)
(x, y) represents the (workerIndex, subGroupIndex), for the workerIndex=2, before the subgroupIndex is 1, after is 0. Not familiar with the TPU implementation, maybe the same? But this is just code implementation, we can leave the detailed review to the PR, just mention this fact in the proposal briefly.
Is this not what we want? With this calculation, we ensure that there are two pods per subgroup, which should be the case if |
Yes, the result is as we expected, I just mean the method may need to change: lws/pkg/webhooks/pod_webhook.go Lines 245 to 251 in ab40b62
For workIndex=2, before it's (2, 1), now it's (2, 0), the subGroupIndex changes. Do I miss anything? I think we'll call the method, right? |
Yeah we call that method. I guess I'm a little bit confused on what scenario you are referring to. The current behavior when using this feature has workerIndex=2 at subgroup 0. If using the default configuration, but still having an odd number for size, it will still be workerIndex=2 at subgroup 0. The only case where workerIndex=2 gives subgroup 1, is if using default behavior and size is even |
Doesn't this should be workerIndex/subGroupSize = 2/2 = 1, rather than 0. Let's move the discussion to PR then. |
What type of PR is this?
/kind documentation
What this PR does / why we need it
Which issue(s) this PR fixes
Part of #257
Special notes for your reviewer
Does this PR introduce a user-facing change?