Skip to content

KEP-2401: Complement torch plugin to support torchtune config mutation #2587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Electronic-Waste
Copy link
Member

@Electronic-Waste Electronic-Waste commented Apr 8, 2025

What this PR does / why we need it:

This PR adds the torchtune config mutation implementation in torch plugin.

As we discussed before, we'll implement the config mutation/validation in server-side, so as to avoid frequent SDK changes and provide better backward compatibility for users.

In details, this PR:

  • Add config mutation/mapping for the torchtune configs passed in .spec.trainer.args
  • Add valiations for torchtune config
  • Add more UTs for torch EnforceMLPolicy and Validate function

REF: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2401-llm-trainer-v2#complement-torch-plugin

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo @saileshd1402 @deepanker13 @akshaychitneni

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2507 #2508

Checklist:

  • Docs included if any changes are user facing

Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

This PR adds the torchtune config mutation implementation in torch plugin.

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo @saileshd1402 @deepanker13 @akshaychitneni

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2507

Checklist:

  • Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste Electronic-Waste marked this pull request as draft April 8, 2025 14:16
@Electronic-Waste Electronic-Waste changed the title KEP-2401: Complement torch plugin to support torchtune config mutation [WIP] KEP-2401: Complement torch plugin to support torchtune config mutation Apr 8, 2025
@coveralls
Copy link

coveralls commented Apr 9, 2025

Pull Request Test Coverage Report for Build 14692575862

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 59 of 61 (96.72%) changed or added relevant lines in 1 file are covered.
  • 62 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.7%) to 67.18%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/torch/torch.go 59 61 96.72%
Files with Coverage Reduction New Missed Lines %
pkg/runtime/framework/plugins/coscheduling/coscheduling.go 62 0.0%
Totals Coverage Status
Change from base Build 14341999020: 0.7%
Covered Lines: 1789
Relevant Lines: 2663

💛 - Coveralls

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Apr 9, 2025
Signed-off-by: Electronic-Waste <[email protected]>
@Electronic-Waste Electronic-Waste marked this pull request as ready for review April 9, 2025 09:39
@Electronic-Waste Electronic-Waste changed the title [WIP] KEP-2401: Complement torch plugin to support torchtune config mutation KEP-2401: Complement torch plugin to support torchtune config mutation Apr 9, 2025
@Electronic-Waste
Copy link
Member Author

PTAL if you have time, thanks:)

/assign @kubeflow/wg-training-leads @astefanutti @akshaychitneni @franciscojavierarceo @deepanker13

@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Apr 22, 2025
Signed-off-by: Electronic-Waste <[email protected]>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Apr 27, 2025
@Electronic-Waste
Copy link
Member Author

@andreyvelich Thanks for your detailed review! I've addressed your comments. PTAL if you have time:)

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.
Just small comment.
/lgtm
/assign @tenzen-y @kubeflow/wg-training-leads @saileshd1402 @astefanutti @franciscojavierarceo for the review.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can move forward.
Let's address any additional changes in the followup PRs.
Thank you for this @Electronic-Waste!
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 040b34e into kubeflow:master Apr 29, 2025
20 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.0 milestone Apr 29, 2025
@Electronic-Waste Electronic-Waste deleted the feat/torchtune-plugin branch April 29, 2025 13:26
@Electronic-Waste
Copy link
Member Author

@andreyvelich @astefanutti Thanks for your detailed review! I'll create another issue to discuss about: #2587 (comment).

szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
kubeflow/trainer#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
kubeflow/trainer#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
kubeflow/trainer#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 5, 2025
kubeflow/trainer#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
kubeflow#2587)

* chore(plugin): Add torchtune-related constants & update current torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add EnforceMLPolicy for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add UTs in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): fix error in torch plugin UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Choose recipe according to numNodes & numProcPerNode & Args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): Add PretrainedModel enum type.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(plugin): Add torchtune config arg.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): add UT for single-device full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Add test for multi-nodes full fine-tuning with torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(test): Update torch validate UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove pretrained model enum type in sdk.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): retrieve model name from runtimeRef.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix typo.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): make some adjustments according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove runtime in get_trainer_crd_from_builtin_trainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): pass PET_ env variables in torch plugin for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): add env validation for torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugin): update comments.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): fix the implementation according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* test(plugins): fix UT error in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: fix UT and e2e tests error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove debug info.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): add args in UTs related to torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update torchtune related args.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): Add a UT for multi-node mode check in torch plugin.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2401: Complement torch plugin to support torchtune config mutation
6 participants