KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590

Electronic-Waste · 2025-04-10T03:09:47Z

What this PR does / why we need it:

This PR adds the CTRs for Llama 3.2 model family, including:

Llama 3.2 1B Instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
Llama 3.2 3B Instruct: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2510

Checklist:

Docs included if any changes are user facing

coveralls · 2025-04-10T03:14:05Z

Pull Request Test Coverage Report for Build 14922684382

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
231 unchanged lines in 7 files lost coverage.
Overall coverage increased (+1.2%) to 28.807%

Files with Coverage Reduction	New Missed Lines	%
pkg/runtime/framework/plugins/coscheduling/coscheduling.go	5	0.0%
pkg/util/testing/compare.go	10	0.0%
pkg/controller/setup.go	16	0.0%
pkg/runtime/framework/plugins/mpi/mpi.go	21	84.86%
pkg/runtime/framework/plugins/jobset/jobset.go	22	40.3%
pkg/controller/trainjob_controller.go	73	0.0%
pkg/util/testing/wrapper.go	84	0.0%

Totals
Change from base Build 14757493082:	1.2%
Covered Lines:	840
Relevant Lines:	2916

💛 - Coveralls

Electronic-Waste · 2025-04-10T03:43:17Z

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo

astefanutti · 2025-04-10T07:02:17Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                  volumes:
+                    - name: initializer
+                      persistentVolumeClaim:
+                        claimName: initializer


Out of curiosity, is it a prerequisite to have this PVC created? Or is it meant to be overwritten somehow?

This is to indicate that, we'll use default storageclass to create this PVC if this PVC does not exist:)

Do you think it's useful to add that in a comment? Or is this documented somewhere else?

Uhhh... It's the default processing logic for k8s. You can refer to: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1 for more information:)

Are you saying it's the default behavior Kubernetes creates the PVC automatically?
I haven't asked about the default StorageClass, I've asked about the PVC :)

Are you saying it's the default behavior Kubernetes creates the PVC automatically?

Yeah (if PVC does not exist). Maybe I didn't get your point then.

I've asked about the PVC

Do you mean that we might reuse the PVC left by previous training tasks by accident? From my perspective, it does not matter since we have different dirs for the downloaded datasets and models (e.g. /workspace/model/Llama-3.2-1B-Instruct).

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

@Electronic-Waste Who will be responsible to manage the initializer PVC at the user's namespace ?

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

I see... Sorry for the confusion. I once considered that JobSet will automatically create PVC if it does not exist.

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

Maybe we can temporarily create the initializer PVC in our SDK when using BuiltinTrainer?

trainer/sdk/kubeflow/trainer/api/trainer_client.py

Lines 194 to 197 in 08232da

# If users choose to use a builtin trainer for post-training.

elif isinstance(trainer, types.BuiltinTrainer):

trainer_crd = utils.get_trainer_crd_from_builtin_trainer(trainer)

Do you have any other thought? @andreyvelich @tenzen-y @astefanutti

~~Ok, for now, let's create this as part of SDK.~~
We should create it in the controller, since the TrainJobs can be triggered via YAML as well.

We should verify that PVC exists in case of Cluster Admins want to pre-create it for users.

Maybe for now, we can just create it in the torch plugin as a temporarily solution.
Any concerns @tenzen-y ?

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

google-oss-prow · 2025-04-10T12:32:05Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kannon92.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Encountered error in CI:

Error from server (Invalid): ClusterTrainingRuntime.trainer.kubeflow.org "torchtune-llama3.2-1b" is invalid: [spec.template.spec.replicatedJobs[2].dependsOn: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Error from server (Invalid): ClusterTrainingRuntime.trainer.kubeflow.org "torchtune-llama3.2-3b" is invalid: [spec.template.spec.replicatedJobs[2].dependsOn: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Seems that jobset does not support multiple items in DependsOn API.

/cc @kubeflow/wg-training-leads @kannon92 @astefanutti

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2025-04-10T12:36:18Z

/hold for kubernetes-sigs/jobset#874

Electronic-Waste · 2025-04-10T13:45:50Z

/hold cancel

andreyvelich · 2025-04-29T18:25:49Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                  volumes:
+                    - name: initializer
+                      persistentVolumeClaim:
+                        claimName: initializer


I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

@Electronic-Waste Who will be responsible to manage the initializer PVC at the user's namespace ?

andreyvelich · 2025-04-29T18:30:47Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                            echo "TorchTune Llama3.2-1B Finetuning Runtime"
+
+                            echo "--------------------------------------"
+                            pip list


As we discussed before, by default these runtimes should contain the correct tune run command to kick off the TrainJob as follows:

apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob metadata: name: fine-tune-llama-3.2 spec: runtimeRef: name: torchtune-llama3.2-1b

Since you are going to override the command in torch plugin anyway, why do you need to have this command?

I think, here you should also check that the runtime belongs to torchtune: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L183

Since you are going to override the command in torch plugin anyway, why do you need to have this command?

Previously, we assume that users will use SDK to launch the training task: https://github.com/kubeflow/trainer/blob/master/docs/proposals/2401-llm-trainer-v2/README.md#propagte-torchtune-settings-with-sdk

And they may use the TrainJob you mentioned to inspect the package dependency in our torchtune-trainer. Do you prefer also supporting launching tasks with yaml+CTR besides SDK+CTR? @andreyvelich

There are no differences in TrainJob YAML when user runs it from SDK or from kubectl.
As you can see, we only add the Trainer API when the trainer is defined by user: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/api/trainer_client.py#L196.

So I suggest, that we add the appropriate command in those Runtimes, so user can just trigger them as:

TrainerClient().train( runtime=Runtime(name="torchtune-llama3.3-70b"), )

Sure, will update it soon:)

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste · 2025-05-01T01:37:10Z

/retest

Electronic-Waste · 2025-05-01T01:49:21Z

I'll add PVC check later. Do you prefer adding the validation in this PR or the following PR? @andreyvelich

Electronic-Waste · 2025-05-01T10:22:38Z

/retest

andreyvelich · 2025-05-01T13:52:22Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml

+                        - mountPath: /workspace/dataset
+                          name: initializer
+                        - mountPath: /workspace/model
+                          name: initializer


How we will tell tune run to use model and dataset from this location rather than downloading it when we run tune run .... ?

Will add it in this issue: #2506

andreyvelich

Thanks @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti

andreyvelich

cc @saileshd1402 @deepanker13 @johnugeorge

andreyvelich · 2025-05-01T14:06:37Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                      image: ghcr.io/kubeflow/trainer/model-initializer
+                      env:
+                        - name: STORAGE_URI
+                          value: hf://meta-llama/Llama-3.2-1B-Instruct


Just one question, I think without the Token you can't download Llama models right ?
Shall we add the comment in this Runtime that user has to insert HF token here?

I think, we should ask users to add HF_TOKEN env in the TrainJob yaml or the initializer arg of train() API in user guide. They may not notice the comment in this Runtime:)

Sounds good, we can add it to the docs

Electronic-Waste · 2025-05-01T14:11:23Z

/rerun-all

manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

andreyvelich · 2025-05-09T00:08:07Z

manifests/base/runtimes/kustomization.yaml

@@ -5,3 +5,4 @@ resources:
  - mlx_distributed.yaml
  - mpi_distributed.yaml
  - torch_distributed.yaml
+  - torchtune


Given that runtimes are not fully functional yet, can we exclude them from the base kustomization file please ?

Sure, I'll remove it soon:)

andreyvelich · 2025-05-09T00:08:56Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                spec:
+                  containers:
+                    - name: node
+                      image: ghcr.io/kubeflow/trainer/torchtune-trainer


@Electronic-Waste Can we add the patch in the overlay to update the trainer image tag ?
Since for the release we should use 2.0.0 tag

I think, maybe we need to use the latest tag instead of 2.0.0 tag since this PR is not ready: #2623

But for the release, we have to create patch to add appropriate tag for our runtimes + trainers.
@Doris-xm @Electronic-Waste Do we want to do that as part of this: #2527 ?

A separate issue/PR might be better, since it's related to release management CI. And users may fail to pull the images if we update it before the v2.0 release. So, my suggestion is that we can use the latest tag for these PRs: #2590, #2527, #2382. And change CI and all tags altogether in the new PR. WDYT? @andreyvelich @Doris-xm

Sure, we can use latest for now, but can we at least a proper overlay for the runtimes that override the image tag with latest like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18

Signed-off-by: Electronic-Waste <[email protected]>

andreyvelich · 2025-05-09T05:48:06Z

manifests/overlays/runtimes/kustomization.yaml

@@ -2,3 +2,8 @@ apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 resources:
  - ../../base/runtimes
+
+# Update the Kubeflow LLM Trainer image tag.
+images:


@Electronic-Waste Please can you verify if it is going to work ?
I thought, Kustomize can only update images for the standarts k8s resources (e.g. Deployment), not CRDs.

I'll have a try. If it does not work, I'll add a patch for this:)

It works for CRD:

$ kustomize build manifests/overlays/runtimes > out.yaml <one of CRDs> apiVersion: trainer.kubeflow.org/v1alpha1 kind: ClusterTrainingRuntime metadata: name: torchtune-llama3.2-1b spec: mlPolicy: numNodes: 1 torch: numProcPerNode: 2 template: spec: replicatedJobs: - name: dataset-initializer template: spec: template: metadata: labels: trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer spec: containers: - env: - name: STORAGE_URI value: hf://tatsu-lab/alpaca image: ghcr.io/kubeflow/trainer/dataset-initializer name: dataset-initializer volumeMounts: - mountPath: /workspace/dataset name: initializer volumes: - name: initializer persistentVolumeClaim: claimName: initializer - dependsOn: - name: dataset-initializer status: Complete name: model-initializer template: metadata: labels: trainer.kubeflow.org/trainjob-ancestor-step: model-initializer spec: template: spec: containers: - env: - name: STORAGE_URI value: hf://meta-llama/Llama-3.2-1B-Instruct image: ghcr.io/kubeflow/trainer/model-initializer name: model-initializer volumeMounts: - mountPath: /workspace/model name: initializer volumes: - name: initializer persistentVolumeClaim: claimName: initializer - dependsOn: - name: model-initializer status: Complete name: node template: metadata: labels: trainer.kubeflow.org/trainjob-ancestor-step: trainer spec: template: spec: containers: - command: - tune - run - full_finetune_distributed - --config llama3_2/1B_full.yaml image: ghcr.io/kubeflow/trainer/torchtune-trainer:latest name: node resources: limits: nvidia.com/gpu: 2 volumeMounts: - mountPath: /workspace/dataset name: initializer - mountPath: /workspace/model name: initializer volumes: - name: initializer persistentVolumeClaim: claimName: initializer

Maybe we should also update images for data-initializer and model-initializer:)

andreyvelich · 2025-05-09T05:59:03Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

+                spec:
+                  containers:
+                    - name: node
+                      image: ghcr.io/kubeflow/trainer/torchtune-trainer


Sure, we can use latest for now, but can we at least a proper overlay for the runtimes that override the image tag with latest like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste · 2025-05-10T02:40:30Z

@andreyvelich I've updated this PR with model-initializer and dataset-intializer overlay images. Please don't hesitate to tell me if there are any other suggestions:)

andreyvelich

Thank you for this contribution @Electronic-Waste!
/lgtm
/approve

google-oss-prow · 2025-05-10T05:43:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…beflow#2590) * chore(manifests): add manifests for llama3_2_1B and llama3_2_3B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Add detailed configuration for llama3_2_1B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Add detailed configuration for llama3_2_3B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): load pvc to trainer node & add initializers. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): fix DependsOn error in CI. Signed-off-by: Electronic-Waste <[email protected]> * fix(runtime): fix launching command for CTRs. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): remove torchtune temporally in base/runtimes. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): provide torchtune-trainer image tag override in overlays. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): update image for data-initializer and model-initializer. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot requested a review from jinchihe April 10, 2025 03:10

google-oss-prow bot added the size/L label Apr 10, 2025

google-oss-prow bot requested a review from kuizhiqing April 10, 2025 03:10

Electronic-Waste mentioned this pull request Apr 10, 2025

KEP-2401: Create LLM Training Runtimes for Llama 3.3 model family #2591

Open

google-oss-prow bot requested review from astefanutti, franciscojavierarceo and a team April 10, 2025 03:43

astefanutti reviewed Apr 10, 2025

View reviewed changes

Electronic-Waste commented Apr 10, 2025

View reviewed changes

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml Outdated Show resolved Hide resolved

google-oss-prow bot requested a review from astefanutti April 10, 2025 12:32

Electronic-Waste mentioned this pull request Apr 10, 2025

Support multiple items in DependsOn API kubernetes-sigs/jobset#874

Closed

3 tasks

google-oss-prow bot added the do-not-merge/hold label Apr 10, 2025

google-oss-prow bot removed the do-not-merge/hold label Apr 10, 2025

This was referenced Apr 10, 2025

KEP-2401: Revisit DependsOn API in CTRs When Supporting Multiple Ancestor #2592

Open

fix(doc): tidy up KEP-2401. #2594

Merged

andreyvelich reviewed Apr 29, 2025

View reviewed changes

Electronic-Waste added 6 commits April 30, 2025 19:17

chore(manifests): add manifests for llama3_2_1B and llama3_2_3B.

045347f

Signed-off-by: Electronic-Waste <[email protected]>

chore(manifests): Add detailed configuration for llama3_2_1B.

2a7b633

Signed-off-by: Electronic-Waste <[email protected]>

chore(manifests): Add detailed configuration for llama3_2_3B.

7e64300

Signed-off-by: Electronic-Waste <[email protected]>

chore(manifests): load pvc to trainer node & add initializers.

8bf6ade

Signed-off-by: Electronic-Waste <[email protected]>

fix(manifest): fix DependsOn error in CI.

5b7f17b

Signed-off-by: Electronic-Waste <[email protected]>

fix(runtime): fix launching command for CTRs.

5baf306

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste force-pushed the feat/llama3_2-manifests branch from e768d45 to 5baf306 Compare April 30, 2025 19:42

Electronic-Waste mentioned this pull request May 1, 2025

KEP-2401: Revisit PVC claim in torchtune CTRs when stateful jobset is ready #2630

Open

andreyvelich reviewed May 1, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti, tenzen-y and andreyvelich May 1, 2025

google-oss-prow bot added the lgtm label May 1, 2025

andreyvelich reviewed May 1, 2025

View reviewed changes

Electronic-Waste mentioned this pull request May 1, 2025

feat: Add PVC existence check for torchtune runtimes #2631

Closed

1 task

szaher reviewed May 5, 2025

View reviewed changes

manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml Show resolved Hide resolved

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml Show resolved Hide resolved

This was referenced May 6, 2025

KEP-2401: Support mutating dataset preprocessing config in SDK #2638

Merged

KEP-2401: Support loading local LLMs #2641

Closed

andreyvelich reviewed May 9, 2025

View reviewed changes

Electronic-Waste added 2 commits May 9, 2025 05:37

fix(manifest): remove torchtune temporally in base/runtimes.

69a4bde

Signed-off-by: Electronic-Waste <[email protected]>

fix(manifest): provide torchtune-trainer image tag override in overlays.

cdbb119

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot removed the lgtm label May 9, 2025

andreyvelich reviewed May 9, 2025

View reviewed changes

fix(manifest): update image for data-initializer and model-initializer.

d0b2fdf

Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste mentioned this pull request May 10, 2025

KEP-2170: Add the manifests overlay for Kubeflow Training V2 #2382

Merged

1 task

Electronic-Waste requested a review from andreyvelich May 10, 2025 02:43

andreyvelich reviewed May 10, 2025

View reviewed changes

google-oss-prow bot added the lgtm label May 10, 2025

google-oss-prow bot added the approved label May 10, 2025

google-oss-prow bot merged commit abc4749 into kubeflow:master May 10, 2025
17 checks passed

google-oss-prow bot added this to the v2.0 milestone May 10, 2025

	# If users choose to use a builtin trainer for post-training.
	elif isinstance(trainer, types.BuiltinTrainer):
	trainer_crd = utils.get_trainer_crd_from_builtin_trainer(trainer)

KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590

KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590

Uh oh!

Conversation

Electronic-Waste commented Apr 10, 2025

Uh oh!

coveralls commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 14922684382

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Electronic-Waste commented Apr 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

google-oss-prow bot commented Apr 10, 2025

Uh oh!

Electronic-Waste commented Apr 10, 2025

Uh oh!

Electronic-Waste commented Apr 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste commented May 1, 2025

Uh oh!

Electronic-Waste commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Electronic-Waste commented May 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

coveralls commented Apr 10, 2025 •

edited

Loading

astefanutti Apr 10, 2025 •

edited

Loading

andreyvelich Apr 30, 2025 •

edited

Loading

Electronic-Waste commented May 1, 2025 •

edited

Loading

andreyvelich May 1, 2025 •

edited

Loading

Electronic-Waste May 9, 2025 •

edited

Loading

Electronic-Waste May 9, 2025 •

edited

Loading