-
Notifications
You must be signed in to change notification settings - Fork 797
KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382
Conversation
Will review it later. /cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402 |
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this @Doris-xm! I left some initial comments for you.
Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)
FYI: KubeCon 2024 NA Talk by Andrey and Yuki
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. Thank you for your great contributions!
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your great contribution and your patience!
I think, we should also add namespace configuration here:
to install these components in kubeflow-system
namespace.
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]>
9518f7b
to
1f1b0c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for updating this:)
/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402
Maybe we could try to fix it in recent days. Hey @Doris-xm, do you get a chance to finish this PR? |
I’ll give it a try and do my best to get it done within a week. @Electronic-Waste |
just remember it does not have to be perfect. We can merge and iterate on it. |
Signed-off-by: Xinmin Du <[email protected]>
We've closed the standalone manifests PR due to the reason listed in #2527 (comment). Different from standalone installtion, Kubeflow Platform will try to install al components until they succeed. So it's okay to install the crds and runtimes together in a single kustomization file. @Doris-xm Please can you update this PR so that we can merge it and unblock the release? REF: #2527 (comment) |
- name: ghcr.io/kubeflow/trainer/deepspeed-runtime | ||
newTag: latest | ||
- name: ghcr.io/kubeflow/trainer/mlx-runtime | ||
newTag: latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you change the order of these images, so we will be in-sync with release-2.0 branch ?
https://github.com/kubeflow/trainer/blob/32a474c8329e2608cab2152ba4f57ee54a749782/manifests/overlays/runtimes/kustomization.yaml
images:
- name: ghcr.io/kubeflow/trainer/torchtune-trainer
newTag: v2.0.0-rc.0
- name: ghcr.io/kubeflow/trainer/dataset-initializer
newTag: v2.0.0-rc.0
- name: ghcr.io/kubeflow/trainer/model-initializer
newTag: v2.0.0-rc.0
- name: ghcr.io/kubeflow/trainer/mlx-runtime
newTag: v2.0.0-rc.0
- name: ghcr.io/kubeflow/trainer/deepspeed-runtime
newTag: v2.0.0-rc.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need change the image tag from latest
to v2.0.0-rc.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the master branch, we should keep them v2.0.0-rc.0
on the release-2.0
branch.
kind: Kustomization | ||
namespace: kubeflow | ||
sortOptions: | ||
order: fifo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fifo
works in a Kubernetes cluster.
I conducted the experiment and found that after removing the fifo flag, the following error occurred:
Error from server (InternalError): Internal error occurred: failed calling webhook "validator.clustertrainingruntime.trainer.kubeflow.org": failed to call webhook: Post "https://kubeflow-trainer-controller-manager.kubeflow.svc:443/validate-trainer-kubeflow-org-v1alpha1-clustertrainingruntime?timeout=10s": no endpoints available for service "kubeflow-trainer-controller-manager"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine. As we discussed, in Kubeflow manifests we re-deploy resources until it will succeed
- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster. | ||
- ../../base/crds | ||
- ../../base/runtimes | ||
- ../../base/manager | ||
- ../../base/rbac | ||
- ../../base/webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just re-use the manager
and runtimes
overlays here and create patch to remove the namespace.yaml
?
Signed-off-by: Xinmin Du <[email protected]>
f24c3ce
to
c3a0e14
Compare
Signed-off-by: Xinmin Du <[email protected]>
@@ -0,0 +1,7 @@ | |||
apiVersion: kustomize.config.k8s.io/v1beta1 | |||
kind: Kustomization | |||
namespace: kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to create patch to remove this namespace:https://github.com/kubeflow/trainer/blob/7335204c3db0611672d5c82848a9a86c8dabca6b/manifests/overlays/manager/namespace.yaml
Otherwise, we can create Kubeflow Trainer resources in kubeflow-system
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that namespace is created by Kubeflow Manifests: https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml#L62
Since it contains the appropriate labels for other tools (e.g. Katib).
Thus, we should remove the namespace creation as part of Kubeflow Trainer manifests for Kubeflow Platform.
You can see example in Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-with-kubeflow/kustomization.yaml#L41
Signed-off-by: Xinmin Du <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @Doris-xm!
/lgtm
/approve
cc @juliusvonkohout @kubeflow/wg-manifests-leads @akagami-harsh you can integrate Kubeflow Trainer in Kubeflow Manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Doris-xm Thanks for this amazing work!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, Electronic-Waste The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-2.0 |
@andreyvelich: #2382 failed to apply on top of branch "release-2.0":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kunal-511 can you synchronize Trainer 2.0 to Kubeflow/manifests ? |
okay i will do that |
@Doris-xm @Electronic-Waste Please can you manually cherry-pick this PR to the |
ok, I will do that |
…w#2382) * Add the manifests overlay for Kubeflow Training V2 Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: adjust permissions, and format changes Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: rename overlay, adjust event permissions Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: make namespace configurable Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: move standalone, only-manager installation in namespace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: add overlay for Kubeflow Platform installation Signed-off-by: Xinmin Du <[email protected]> * add permission for pods log read & rm persistentvolumeclaims Signed-off-by: Xinmin Du <[email protected]> * create the runtimes before the webhooks Signed-off-by: Xinmin Du <[email protected]> * Specify sorting order: fifo Signed-off-by: Xinmin Du <[email protected]> * Deploy jobset first Signed-off-by: Xinmin Du <[email protected]> * remove edit permissions to runtimes; install runtimes after crds Signed-off-by: Xinmin Du <[email protected]> * remove pretraining directory Signed-off-by: Xinmin Du <[email protected]> * patch runtimes images Signed-off-by: Xinmin Du <[email protected]> * fix: correct image Signed-off-by: Xinmin Du <[email protected]> * add image patch for more runtimes Signed-off-by: Xinmin Du <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * role_bind for notebook & profile Signed-off-by: Xinmin Du <[email protected]> * fix: reorder images Signed-off-by: Xinmin Du <[email protected]> * fix: reuse overlay/manager & runtimes Signed-off-by: Xinmin Du <[email protected]> * fix: remove namespace with patch Signed-off-by: Xinmin Du <[email protected]> --------- Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Du Xinmin <[email protected]> Co-authored-by: Xinmin Du <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
What this PR does / why we need it:
This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.
Which issue(s) this PR fixes :
Fixes #2381
Checklist: