Skip to content

Add KEP #257: Leader Only SubGroup #402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 13, 2025

Conversation

Edwinhr716
Copy link
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it

Which issue(s) this PR fixes

Part of #257

Special notes for your reviewer

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Edwinhr716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 27, 2025
@kerthcet
Copy link
Contributor

kerthcet commented Mar 3, 2025

I may review this tomorrow because of limited bandwidth, sorry for that.
/assign

@kerthcet
Copy link
Contributor

kerthcet commented Mar 6, 2025

reviewing now.

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several comments.

const (
SubGroupPolicyLeaderWorker SubGroupPolicyType = "LeaderWorker"

SubGroupPolicyLeaderOnly SubGroupPolicyType = "LeaderOnly"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LeaderOnly is a bit hard to understand, what about LeaderAlone vs All.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with LeaderAlone, but All sounds a bit ambiguous to me. LeaderAlone vs LeaderWorker maybe? Or something that specifies that the leader and a worker pod will be part of the same subgroup

to the number of workers (so size - 1).

### Non-Goals
This KEP assumes that the leader will not request TPU resources, and thus, subGroupSize will always be assumed to be an odd number. Moreover, no TPU environment variables will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, is this only work for TPU? I mean the exclusive annotation, I forgot it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both TPU and GPU are supported for exclusive placement. I'll rephrase to be accelerator agnostic


* `leaderworkerset.sigs.k8s.io/subgroup-policy-type`

In order to keep backwards compatability, it will only be added if the type is `LeaderOnly`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even we add the annotation for both types is still backward compatibility. No behavior change.

Copy link
Contributor Author

@Edwinhr716 Edwinhr716 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario that I see is that there is an existing deployment using SubGroupPolicy which then upgrades the LWS controller. After the upgrade, it will inject the annotation on the PodSpec, which will then trigger an update at the leader Sts level since the Pod has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, this is disgusting, usually we only reconcile when spec changes.

}
```

Because the leader now occupies a subgroup, and `SubGroupSize` will not always be one, the way we calculate the SubGroupIndex needs to be modified. When generating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One argue here is should we mark the leader a sub group? What's the Cons if we don't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, hadn't considered not making the leader part of a subgroup at all. Doing it this way simplifies the rest of the changes (though the TPU environment injection will still need some adjustment). Will test it out and add to the Alternatives which ever version we don't implement

@Edwinhr716
Copy link
Contributor Author

Changed implementation to the leader not actually being part of any subgroup since it is simpler, added original implementation to Alternatives. Also changed LeaderOnly to LeaderAlone. Please take another look @kerthcet

In order to keep backwards compatability, it will only be added if the type is `LeaderAlone`.

### Subgroup Creation
Implementation wise, the only change needed is to not add the SubGroup labels on the leader if the SubGroupType is LeaderAlone. Effectively, this means
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexing within the group will be different now. Currently we don't define an index within the group, we do it only for tpuWorkerId, and this should be calculated differently now to exclude the leader: https://github.com/kubernetes-sigs/lws/blob/main/pkg/utils/accelerators/tpu.go#L113, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do this when the leader doesn't request TPU resources https://github.com/kubernetes-sigs/lws/blob/main/pkg/utils/accelerators/tpu.go#L115-L117

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave more implementation details to the PR.

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the calculation of worker subGroup index still needs to change, for example, we set the subgroupSize=2,

  • before with group size = 6, it looks like (0, 0), (1, 0), (2, 1), (3, 1), (4, 2), (5 ,2)
  • after with group size = 7, it looks like (0, x), (1, 0), (2, 0), (3, 1), (4, 1), (5, 2), (6, 2)

(x, y) represents the (workerIndex, subGroupIndex), for the workerIndex=2, before the subgroupIndex is 1, after is 0. Not familiar with the TPU implementation, maybe the same? But this is just code implementation, we can leave the detailed review to the PR, just mention this fact in the proposal briefly.

@Edwinhr716
Copy link
Contributor Author

I think the calculation of worker subGroup index still needs to change, for example, we set the subgroupSize=2,

  • before with group size = 6, it looks like (0, 0), (1, 0), (2, 1), (3, 1), (4, 2), (5 ,2)
  • after with group size = 7, it looks like (0, x), (1, 0), (2, 0), (3, 1), (4, 1), (5, 2), (6, 2)

(x, y) represents the (workerIndex, subGroupIndex), for the workerIndex=2, before the subgroupIndex is 1, after is 0. Not familiar with the TPU implementation, maybe the same? But this is just code implementation, we can leave the detailed review to the PR, just mention this fact in the proposal briefly.

Is this not what we want? With this calculation, we ensure that there are two pods per subgroup, which should be the case if subGroupSize=2. If we want the subGroupIndex to match, then the first subgroup would only have 1 pod, same with the last subgroup.

@kerthcet
Copy link
Contributor

Yes, the result is as we expected, I just mean the method may need to change:

func getSubGroupIndex(podCount int, subGroupSize int, workerIndex int) string {
if (podCount-1)%subGroupSize == 0 {
// Leader is considered as extra pod, it is part of the first group
return fmt.Sprint((workerIndex - 1) / subGroupSize)
}
return fmt.Sprint(workerIndex / subGroupSize)
}

For workIndex=2, before it's (2, 1), now it's (2, 0), the subGroupIndex changes. Do I miss anything? I think we'll call the method, right?

@Edwinhr716
Copy link
Contributor Author

Edwinhr716 commented Mar 13, 2025

Yeah we call that method. I guess I'm a little bit confused on what scenario you are referring to. The current behavior when using this feature has workerIndex=2 at subgroup 0. If using the default configuration, but still having an odd number for size, it will still be workerIndex=2 at subgroup 0. The only case where workerIndex=2 gives subgroup 1, is if using default behavior and size is even

@kerthcet
Copy link
Contributor

If using the default configuration, but still having an odd number for size, it will still be workerIndex=2 at subgroup 0

Doesn't this should be workerIndex/subGroupSize = 2/2 = 1, rather than 0.

Let's move the discussion to PR then.
/lgtm
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 13, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 639147f into kubernetes-sigs:main Mar 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants