Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add KEP #257: Leader Only SubGroup #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP #257: Leader Only SubGroup #402
Changes from 1 commit
fa0670e
1bf86ea
cc18355
d47f518
0cef2ed
b9c7aa3
aebce78
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, is this only work for TPU? I mean the exclusive annotation, I forgot it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both TPU and GPU are supported for exclusive placement. I'll rephrase to be accelerator agnostic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
LeaderOnly
is a bit hard to understand, what aboutLeaderAlone
vsAll
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with
LeaderAlone
, butAll
sounds a bit ambiguous to me.LeaderAlone
vsLeaderWorker
maybe? Or something that specifies that the leader and a worker pod will be part of the same subgroupThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even we add the annotation for both types is still backward compatibility. No behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario that I see is that there is an existing deployment using
SubGroupPolicy
which then upgrades the LWS controller. After the upgrade, it will inject the annotation on the PodSpec, which will then trigger an update at the leader Sts level since the Pod has changed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, this is disgusting, usually we only reconcile when spec changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One argue here is should we mark the leader a sub group? What's the Cons if we don't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, hadn't considered not making the leader part of a subgroup at all. Doing it this way simplifies the rest of the changes (though the TPU environment injection will still need some adjustment). Will test it out and add to the
Alternatives
which ever version we don't implement