Skip to content

Add KEP #257: Leader Only SubGroup #402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 13, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
353 changes: 353 additions & 0 deletions keps/257-Subgroup-leader-only/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
# KEP-257: SubGroup Leader Only

<!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->

<!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.

Ensure the TOC is wrapped with
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
tags, and then generate with `hack/update-toc.sh`.
-->

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [SubGroupIndex and TPU Environment Injection](#subgroupindex-and-tpu-environment-injection)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->

## Summary

<!--
This section is incredibly important for producing high-quality, user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins, in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself. KEP editors and SIG Docs
should help to ensure that the tone and content of the `Summary` section is
useful for a wide audience.

A good summary is probably at least a paragraph in length.

Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.

[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
This KEP aims to extend the SubGroup feature by adding an API field that defines the type of Subgroup that will be created. The existing version
will become "LeaderWorker", and a new type will be added to create a SubGroup that only includes the leader.


## Motivation

<!--
This section is for explicitly listing the motivation, goals, and non-goals of
this KEP. Describe why the change is important and the benefits to users. The
motivation section can optionally provide links to [experience reports] to
demonstrate the interest in a KEP within the wider Kubernetes community.

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

There are cases where the resource requirements for the leader are different from the ones by the worker, while still being able to guarantee exclusive placement
on the workers. Having a leaderOnly subgroup means that the placement of the leader will be independent of the workers, and by adding the `subgroup-exclusive-topology`, the
workers can be guaranteed to be scheduled on the same topology.

### Goals

<!--
List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->

Add an option to create a LeaderOnly subgroup, while still being able to create subgroups on the other workers. Therefore, it should be possible to do the following split:
(leader), (worker-1, worker-2), (worker-3, worker-4). If the desired effect is to have one subgroup for the leader, and one subgroup for the worker, subgroupSize should be set
to the number of workers (so size - 1).

### Non-Goals
This KEP assumes that the leader will not request TPU resources, and thus, subGroupSize will always be assumed to be an odd number. Moreover, no TPU environment variables will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, is this only work for TPU? I mean the exclusive annotation, I forgot it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both TPU and GPU are supported for exclusive placement. I'll rephrase to be accelerator agnostic

injected on the leader.

<!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. What is the desired outcome and how do we measure success?.
The "Design Details" section below is for the real
nitty-gritty.
-->

### User Stories (Optional)

<!--
Detail the things that people will be able to do if this KEP is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.
-->

#### Story 1
As a user, I should be able to create a subGroup that only includes the leader in order to be able to schedule the leader on a different topology while
still keeping exclusive topology on the workers.

### Notes/Constraints/Caveats (Optional)
As mentioned in Non-Goals, this KEP assumes that the leader does not request TPU resources.

<!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations

<!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.

How will security be reviewed, and by whom?

How will UX be reviewed, and by whom?

Consider including folks who also work outside the SIG or subproject.
-->

## Design Details

<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->

A new API field will be added under the SubGroupPolicy type. This API field will determine what type
of subgroup is desired. The default is LeaderWorker, which is the existing implementation of subgroup. `SubGroupSize`
will continue to be a required field, while `Type` will be optional.

```golang
type SubGroupPolicy struct {

// +kubebuilder:validation:Enum={LeaderWorker,LeaderOnly}
// +kubebuilder:default=LeaderWorker
Type SubGroupPolicyType `json:"subGroupPolicyType,omitempty"`

SubGroupSize *int32 `json:"subGroupSize,omitempty"`
}

type SubGroupPolicyType string

const (
SubGroupPolicyLeaderWorker SubGroupPolicyType = "LeaderWorker"

SubGroupPolicyLeaderOnly SubGroupPolicyType = "LeaderOnly"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LeaderOnly is a bit hard to understand, what about LeaderAlone vs All.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with LeaderAlone, but All sounds a bit ambiguous to me. LeaderAlone vs LeaderWorker maybe? Or something that specifies that the leader and a worker pod will be part of the same subgroup

)
```

A new annotation will be created to determine whether or not the subgroup type is LeaderOnly

* `leaderworkerset.sigs.k8s.io/subgroup-policy-type`

In order to keep backwards compatability, it will only be added if the type is `LeaderOnly`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even we add the annotation for both types is still backward compatibility. No behavior change.

Copy link
Contributor Author

@Edwinhr716 Edwinhr716 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario that I see is that there is an existing deployment using SubGroupPolicy which then upgrades the LWS controller. After the upgrade, it will inject the annotation on the PodSpec, which will then trigger an update at the leader Sts level since the Pod has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, this is disgusting, usually we only reconcile when spec changes.


### SubGroupIndex and TPU Environment Injection
SubGroupIndex is a label that we use to generate the hash that is used for exclusive affinity

```golang
subGroupUniqueKey := genGroupUniqueKey(leaderName, subGroupIndexKey)
pod.Labels[leaderworkerset.SubGroupUniqueHashLabelKey] = subGroupUniqueKey
if subEpKey, foundSubEpKey := pod.Annotations[leaderworkerset.SubGroupExclusiveKeyAnnotationKey]; foundSubEpKey {
SetExclusiveAffinities(pod, subGroupUniqueKey, subEpKey, leaderworkerset.SubGroupUniqueHashLabelKey)
}
```
and what we use to determine what pod names to include when injecting the TPU Host names

```golang
start := subGroupSize*subGroupIndex + 1
end := subGroupSize * (subGroupIndex + 1)
for i := start; i <= end; i++ {
hostnames = append(hostnames, fmt.Sprintf("%s-%d.%s", leaderName, i, pod.Spec.Subdomain))
}
```

Because the leader now occupies a subgroup, and `SubGroupSize` will not always be one, the way we calculate the SubGroupIndex needs to be modified. When generating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One argue here is should we mark the leader a sub group? What's the Cons if we don't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, hadn't considered not making the leader part of a subgroup at all. Doing it this way simplifies the rest of the changes (though the TPU environment injection will still need some adjustment). Will test it out and add to the Alternatives which ever version we don't implement

the hash, we will shift one to the right.

```golang
if leaderOnly {
return fmt.Sprint(((workerIndex - 1) / subGroupSize) + 1)
}
```

When injecting TPU environment variables, we will keep the same logic that exists for when the leader doesn't request TPU resources. However, all the workers will have their
subGroupIndex shift back, so that the start and end calculation is still accurate

```golang
// Take worker 1 as an example with subGroupSize set to 2. It will now have subGroupIndex set to 1, which means that the start will be 3, and end will be 4.
// Need to shift it down by 1 for the calculation to still work out.
if leaderOnlyType && subGroupIndex != 0 {
subGroupIndex -= 1
}
start := subGroupSize*subGroupIndex + 1
end := subGroupSize * (subGroupIndex + 1)
```

### Test Plan

<!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.

All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.

[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->

[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Prerequisite testing updates

<!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->

##### Unit tests
- Add unit tests to validate that the right subGroupIndex is injected
- Add unit tests to validate that the correct env TPU variables are injected


<!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->

<!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- <package>: <date> - <current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit

This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

##### Integration tests

<!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->

<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.

For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->

- <test>: <link to test coverage>

##### e2e tests

<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.

For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html

We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->

- <test>: <link to test coverage>

### Graduation Criteria

<!--

Clearly define what it means for the feature to be implemented and
considered stable.

If the feature you are introducing has high complexity, consider adding graduation
milestones with these graduation criteria:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]

[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-->

## Implementation History

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->

## Alternatives

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
17 changes: 17 additions & 0 deletions keps/257-Subgroup-leader-only/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
title: Controller Revision
kep-number: 238
authors:
- edwinhr716
status: implementable
creation-date: 2025-02-27
reviewers:
- kerthcet
- ahg-g
approvers:
- kerthcet
- ahg-g

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v0.6.0"