Skip to content

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jun 30, 2025

Conversation

Doris-xm
Copy link
Contributor

@Doris-xm Doris-xm commented Jan 9, 2025

What this PR does / why we need it:

This PR adds the manifests overlay for Kubeflow Training V2, allowing to install it within Kubeflow Platform.

Which issue(s) this PR fixes :
Fixes #2381

Checklist:

  • Docs included if any changes are user facing

@Electronic-Waste
Copy link
Member

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

@google-oss-prow google-oss-prow bot requested a review from a team January 9, 2025 15:25
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/release-team, saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Will review it later.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @saileshd1402
/ok-to-test
/rerun-all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this @Doris-xm! I left some initial comments for you.

Btw, I recommend that you could learn more about the concept of Training V2. This will help you update the manifests overlay in training-operator and kubeflow/manifests:)

FYI: KubeCon 2024 NA Talk by Andrey and Yuki

@Electronic-Waste
Copy link
Member

/rerun-all

@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Jan 10, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Thank you for your great contributions!

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @willb!
/assign @kubeflow/wg-manifests-leads @kubeflow/release-team

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM!

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xinmin Du added 5 commits January 13, 2025 15:16
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
…ace: kubeflow-system

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
@Doris-xm Doris-xm force-pushed the add-overlay-manifest-v2 branch from 9518f7b to 1f1b0c2 Compare January 13, 2025 07:17
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for updating this:)

/lgtm
/assign @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads @saileshd1402

@google-oss-prow google-oss-prow bot added the lgtm label Jan 13, 2025
@andreyvelich andreyvelich changed the title Add the manifests overlay for Kubeflow Training V2 KEP-2170: Add the manifests overlay for Kubeflow Training V2 Jan 14, 2025
@Electronic-Waste
Copy link
Member

Maybe we could try to fix it in recent days. Hey @Doris-xm, do you get a chance to finish this PR?

@Doris-xm
Copy link
Contributor Author

Doris-xm commented Jun 7, 2025

I’ll give it a try and do my best to get it done within a week. @Electronic-Waste
cc @andreyvelich

@juliusvonkohout
Copy link
Member

just remember it does not have to be perfect. We can merge and iterate on it.

@Electronic-Waste
Copy link
Member

We've closed the standalone manifests PR due to the reason listed in #2527 (comment).

Different from standalone installtion, Kubeflow Platform will try to install al components until they succeed. So it's okay to install the crds and runtimes together in a single kustomization file.

@Doris-xm Please can you update this PR so that we can merge it and unblock the release?

REF: #2527 (comment)

Comment on lines 8 to 11
- name: ghcr.io/kubeflow/trainer/deepspeed-runtime
newTag: latest
- name: ghcr.io/kubeflow/trainer/mlx-runtime
newTag: latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you change the order of these images, so we will be in-sync with release-2.0 branch ?
https://github.com/kubeflow/trainer/blob/32a474c8329e2608cab2152ba4f57ee54a749782/manifests/overlays/runtimes/kustomization.yaml

images:
  - name: ghcr.io/kubeflow/trainer/torchtune-trainer
    newTag: v2.0.0-rc.0
  - name: ghcr.io/kubeflow/trainer/dataset-initializer
    newTag: v2.0.0-rc.0
  - name: ghcr.io/kubeflow/trainer/model-initializer
    newTag: v2.0.0-rc.0
  - name: ghcr.io/kubeflow/trainer/mlx-runtime
    newTag: v2.0.0-rc.0
  - name: ghcr.io/kubeflow/trainer/deepspeed-runtime
    newTag: v2.0.0-rc.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need change the image tag from latest to v2.0.0-rc.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the master branch, we should keep them v2.0.0-rc.0 on the release-2.0 branch.

kind: Kustomization
namespace: kubeflow
sortOptions:
order: fifo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fifo works in a Kubernetes cluster.
I conducted the experiment and found that after removing the fifo flag, the following error occurred:

Error from server (InternalError): Internal error occurred: failed calling webhook "validator.clustertrainingruntime.trainer.kubeflow.org": failed to call webhook: Post "https://kubeflow-trainer-controller-manager.kubeflow.svc:443/validate-trainer-kubeflow-org-v1alpha1-clustertrainingruntime?timeout=10s": no endpoints available for service "kubeflow-trainer-controller-manager"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine. As we discussed, in Kubeflow manifests we re-deploy resources until it will succeed

Comment on lines 7 to 12
- ../../third-party/jobset # Comment this line if JobSet is installed on the Kubernetes cluster.
- ../../base/crds
- ../../base/runtimes
- ../../base/manager
- ../../base/rbac
- ../../base/webhook
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just re-use the manager and runtimes overlays here and create patch to remove the namespace.yaml?

Signed-off-by: Xinmin Du <[email protected]>
@Doris-xm Doris-xm force-pushed the add-overlay-manifest-v2 branch from f24c3ce to c3a0e14 Compare June 30, 2025 11:09
@@ -0,0 +1,7 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to create patch to remove this namespace:https://github.com/kubeflow/trainer/blob/7335204c3db0611672d5c82848a9a86c8dabca6b/manifests/overlays/manager/namespace.yaml

Otherwise, we can create Kubeflow Trainer resources in kubeflow-system namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experiment, the namespace: kubeflow in line 3 can overwrite the kubeflow-system namespace.

2025-06-30 19 27 41

So do we still need complex patch strategies, such as:

patches:
  - target:
      kind: Namespace
      name: .*
    patch: |-
      - op: replace
        path: /metadata/name
        value: kubeflow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure that namespace is created by Kubeflow Manifests: https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml#L62
Since it contains the appropriate labels for other tools (e.g. Katib).
Thus, we should remove the namespace creation as part of Kubeflow Trainer manifests for Kubeflow Platform.

You can see example in Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-with-kubeflow/kustomization.yaml#L41

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @Doris-xm!
/lgtm
/approve
cc @juliusvonkohout @kubeflow/wg-manifests-leads @akagami-harsh you can integrate Kubeflow Trainer in Kubeflow Manifests

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Doris-xm Thanks for this amazing work!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Electronic-Waste,andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d3ada62 into kubeflow:master Jun 30, 2025
17 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.0 milestone Jun 30, 2025
@andreyvelich
Copy link
Member

/cherry-pick release-2.0

@google-oss-robot
Copy link

@andreyvelich: #2382 failed to apply on top of branch "release-2.0":

Applying: Add the manifests overlay for Kubeflow Training V2
Applying: Update manifest: adjust permissions, and format changes
Applying: Update manifest: rename overlay, adjust event permissions
Applying: Update manifest: make namespace configurable
Using index info to reconstruct a base tree...
A	manifests/v2/base/manager/kustomization.yaml
A	manifests/v2/base/rbac/kustomization.yaml
A	manifests/v2/base/webhook/kustomization.yaml
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): manifests/v2/base/rbac/kustomization.yaml deleted in HEAD and modified in Update manifest: make namespace configurable. Version Update manifest: make namespace configurable of manifests/v2/base/rbac/kustomization.yaml left in tree.
CONFLICT (modify/delete): manifests/v2/base/manager/kustomization.yaml deleted in HEAD and modified in Update manifest: make namespace configurable. Version Update manifest: make namespace configurable of manifests/v2/base/manager/kustomization.yaml left in tree.
Auto-merging manifests/base/webhook/kustomization.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0004 Update manifest: make namespace configurable
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juliusvonkohout
Copy link
Member

@kunal-511 can you synchronize Trainer 2.0 to Kubeflow/manifests ?

@kunal-511
Copy link

@kunal-511 can you synchronize Trainer 2.0 to Kubeflow/manifests ?

okay i will do that

@andreyvelich
Copy link
Member

@Doris-xm @Electronic-Waste Please can you manually cherry-pick this PR to the release-2.0 branch to resolve conflicts ?

@Doris-xm
Copy link
Contributor Author

@Doris-xm @Electronic-Waste Please can you manually cherry-pick this PR to the release-2.0 branch to resolve conflicts ?

ok, I will do that

akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
…w#2382)

* Add the manifests overlay for Kubeflow Training V2

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: adjust permissions, and format changes

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: rename overlay, adjust event permissions

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: make namespace configurable

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: move standalone, only-manager installation in namespace: kubeflow-system

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: add overlay for Kubeflow Platform installation

Signed-off-by: Xinmin Du <[email protected]>

* add permission for pods log read & rm persistentvolumeclaims

Signed-off-by: Xinmin Du <[email protected]>

* create the runtimes before the webhooks

Signed-off-by: Xinmin Du <[email protected]>

* Specify sorting order: fifo

Signed-off-by: Xinmin Du <[email protected]>

* Deploy jobset first

Signed-off-by: Xinmin Du <[email protected]>

* remove edit permissions to runtimes; install runtimes after crds

Signed-off-by: Xinmin Du <[email protected]>

* remove pretraining directory

Signed-off-by: Xinmin Du <[email protected]>

* patch runtimes images

Signed-off-by: Xinmin Du <[email protected]>

* fix: correct image

Signed-off-by: Xinmin Du <[email protected]>

* add image patch for more runtimes

Signed-off-by: Xinmin Du <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* role_bind for notebook & profile

Signed-off-by: Xinmin Du <[email protected]>

* fix: reorder images

Signed-off-by: Xinmin Du <[email protected]>

* fix: reuse overlay/manager & runtimes

Signed-off-by: Xinmin Du <[email protected]>

* fix: remove namespace with patch

Signed-off-by: Xinmin Du <[email protected]>

---------

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>
Co-authored-by: Xinmin Du <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add Installation for Kubeflow Platform
9 participants