Skip to content

KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

This PR adds the CTRs for Llama 3.2 model family, including:

  1. Llama 3.2 1B Instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
  2. Llama 3.2 3B Instruct: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2510

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow google-oss-prow bot requested a review from jinchihe April 10, 2025 03:10
@google-oss-prow google-oss-prow bot requested a review from kuizhiqing April 10, 2025 03:10
@coveralls
Copy link

coveralls commented Apr 10, 2025

Pull Request Test Coverage Report for Build 14922684382

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 231 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+1.2%) to 28.807%

Files with Coverage Reduction New Missed Lines %
pkg/runtime/framework/plugins/coscheduling/coscheduling.go 5 0.0%
pkg/util/testing/compare.go 10 0.0%
pkg/controller/setup.go 16 0.0%
pkg/runtime/framework/plugins/mpi/mpi.go 21 84.86%
pkg/runtime/framework/plugins/jobset/jobset.go 22 40.3%
pkg/controller/trainjob_controller.go 73 0.0%
pkg/util/testing/wrapper.go 84 0.0%
Totals Coverage Status
Change from base Build 14757493082: 1.2%
Covered Lines: 840
Relevant Lines: 2916

💛 - Coveralls

@Electronic-Waste
Copy link
Member Author

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo

volumes:
- name: initializer
persistentVolumeClaim:
claimName: initializer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is it a prerequisite to have this PVC created? Or is it meant to be overwritten somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to indicate that, we'll use default storageclass to create this PVC if this PVC does not exist:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's useful to add that in a comment? Or is this documented somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhh... It's the default processing logic for k8s. You can refer to: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1 for more information:)

Copy link
Contributor

@astefanutti astefanutti Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying it's the default behavior Kubernetes creates the PVC automatically?
I haven't asked about the default StorageClass, I've asked about the PVC :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying it's the default behavior Kubernetes creates the PVC automatically?

Yeah (if PVC does not exist). Maybe I didn't get your point then.

I've asked about the PVC

Do you mean that we might reuse the PVC left by previous training tasks by accident? From my perspective, it does not matter since we have different dirs for the downloaded datasets and models (e.g. /workspace/model/Llama-3.2-1B-Instruct).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

@Electronic-Waste Who will be responsible to manage the initializer PVC at the user's namespace ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

I see... Sorry for the confusion. I once considered that JobSet will automatically create PVC if it does not exist.

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

Maybe we can temporarily create the initializer PVC in our SDK when using BuiltinTrainer?

# If users choose to use a builtin trainer for post-training.
elif isinstance(trainer, types.BuiltinTrainer):
trainer_crd = utils.get_trainer_crd_from_builtin_trainer(trainer)

Do you have any other thought? @andreyvelich @tenzen-y @astefanutti

Copy link
Member

@andreyvelich andreyvelich Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for now, let's create this as part of SDK.
We should create it in the controller, since the TrainJobs can be triggered via YAML as well.

We should verify that PVC exists in case of Cluster Admins want to pre-create it for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for now, we can just create it in the torch plugin as a temporarily solution.
Any concerns @tenzen-y ?

@google-oss-prow google-oss-prow bot requested a review from astefanutti April 10, 2025 12:32
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kannon92.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Encountered error in CI:

Error from server (Invalid): ClusterTrainingRuntime.trainer.kubeflow.org "torchtune-llama3.2-1b" is invalid: [spec.template.spec.replicatedJobs[2].dependsOn: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Error from server (Invalid): ClusterTrainingRuntime.trainer.kubeflow.org "torchtune-llama3.2-3b" is invalid: [spec.template.spec.replicatedJobs[2].dependsOn: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Seems that jobset does not support multiple items in DependsOn API.

/cc @kubeflow/wg-training-leads @kannon92 @astefanutti

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member Author

/hold for kubernetes-sigs/jobset#874

@Electronic-Waste
Copy link
Member Author

/hold cancel

volumes:
- name: initializer
persistentVolumeClaim:
claimName: initializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572

@Electronic-Waste Who will be responsible to manage the initializer PVC at the user's namespace ?

echo "TorchTune Llama3.2-1B Finetuning Runtime"

echo "--------------------------------------"
pip list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, by default these runtimes should contain the correct tune run command to kick off the TrainJob as follows:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: fine-tune-llama-3.2
spec:
  runtimeRef:
    name: torchtune-llama3.2-1b

Since you are going to override the command in torch plugin anyway, why do you need to have this command?

I think, here you should also check that the runtime belongs to torchtune: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L183

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are going to override the command in torch plugin anyway, why do you need to have this command?

Previously, we assume that users will use SDK to launch the training task: https://github.com/kubeflow/trainer/blob/master/docs/proposals/2401-llm-trainer-v2/README.md#propagte-torchtune-settings-with-sdk

And they may use the TrainJob you mentioned to inspect the package dependency in our torchtune-trainer. Do you prefer also supporting launching tasks with yaml+CTR besides SDK+CTR? @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no differences in TrainJob YAML when user runs it from SDK or from kubectl.
As you can see, we only add the Trainer API when the trainer is defined by user: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/api/trainer_client.py#L196.

So I suggest, that we add the appropriate command in those Runtimes, so user can just trigger them as:

TrainerClient().train(
    runtime=Runtime(name="torchtune-llama3.3-70b"),
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update it soon:)

@Electronic-Waste Electronic-Waste force-pushed the feat/llama3_2-manifests branch from e768d45 to 5baf306 Compare April 30, 2025 19:42
@Electronic-Waste
Copy link
Member Author

/retest

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented May 1, 2025

I'll add PVC check later. Do you prefer adding the validation in this PR or the following PR? @andreyvelich

@Electronic-Waste
Copy link
Member Author

/retest

Comment on lines +81 to +84
- mountPath: /workspace/dataset
name: initializer
- mountPath: /workspace/model
name: initializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we will tell tune run to use model and dataset from this location rather than downloading it when we run tune run .... ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it in this issue: #2506

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image: ghcr.io/kubeflow/trainer/model-initializer
env:
- name: STORAGE_URI
value: hf://meta-llama/Llama-3.2-1B-Instruct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, I think without the Token you can't download Llama models right ?
Shall we add the comment in this Runtime that user has to insert HF token here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should ask users to add HF_TOKEN env in the TrainJob yaml or the initializer arg of train() API in user guide. They may not notice the comment in this Runtime:)

Copy link
Member

@andreyvelich andreyvelich May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can add it to the docs

@Electronic-Waste
Copy link
Member Author

/rerun-all

@@ -5,3 +5,4 @@ resources:
- mlx_distributed.yaml
- mpi_distributed.yaml
- torch_distributed.yaml
- torchtune
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that runtimes are not fully functional yet, can we exclude them from the base kustomization file please ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll remove it soon:)

spec:
containers:
- name: node
image: ghcr.io/kubeflow/trainer/torchtune-trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Electronic-Waste Can we add the patch in the overlay to update the trainer image tag ?
Since for the release we should use 2.0.0 tag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, maybe we need to use the latest tag instead of 2.0.0 tag since this PR is not ready: #2623

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for the release, we have to create patch to add appropriate tag for our runtimes + trainers.
@Doris-xm @Electronic-Waste Do we want to do that as part of this: #2527 ?

Copy link
Member Author

@Electronic-Waste Electronic-Waste May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate issue/PR might be better, since it's related to release management CI. And users may fail to pull the images if we update it before the v2.0 release. So, my suggestion is that we can use the latest tag for these PRs: #2590, #2527, #2382. And change CI and all tags altogether in the new PR. WDYT? @andreyvelich @Doris-xm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can use latest for now, but can we at least a proper overlay for the runtimes that override the image tag with latest like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@google-oss-prow google-oss-prow bot removed the lgtm label May 9, 2025
@@ -2,3 +2,8 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base/runtimes

# Update the Kubeflow LLM Trainer image tag.
images:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Electronic-Waste Please can you verify if it is going to work ?
I thought, Kustomize can only update images for the standarts k8s resources (e.g. Deployment), not CRDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a try. If it does not work, I'll add a patch for this:)

Copy link
Member Author

@Electronic-Waste Electronic-Waste May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for CRD:

$ kustomize build manifests/overlays/runtimes > out.yaml
<one of CRDs>
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torchtune-llama3.2-1b
spec:
  mlPolicy:
    numNodes: 1
    torch:
      numProcPerNode: 2
  template:
    spec:
      replicatedJobs:
      - name: dataset-initializer
        template:
          spec:
            template:
              metadata:
                labels:
                  trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer
              spec:
                containers:
                - env:
                  - name: STORAGE_URI
                    value: hf://tatsu-lab/alpaca
                  image: ghcr.io/kubeflow/trainer/dataset-initializer
                  name: dataset-initializer
                  volumeMounts:
                  - mountPath: /workspace/dataset
                    name: initializer
                volumes:
                - name: initializer
                  persistentVolumeClaim:
                    claimName: initializer
      - dependsOn:
        - name: dataset-initializer
          status: Complete
        name: model-initializer
        template:
          metadata:
            labels:
              trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
          spec:
            template:
              spec:
                containers:
                - env:
                  - name: STORAGE_URI
                    value: hf://meta-llama/Llama-3.2-1B-Instruct
                  image: ghcr.io/kubeflow/trainer/model-initializer
                  name: model-initializer
                  volumeMounts:
                  - mountPath: /workspace/model
                    name: initializer
                volumes:
                - name: initializer
                  persistentVolumeClaim:
                    claimName: initializer
      - dependsOn:
        - name: model-initializer
          status: Complete
        name: node
        template:
          metadata:
            labels:
              trainer.kubeflow.org/trainjob-ancestor-step: trainer
          spec:
            template:
              spec:
                containers:
                - command:
                  - tune
                  - run
                  - full_finetune_distributed
                  - --config llama3_2/1B_full.yaml
                  image: ghcr.io/kubeflow/trainer/torchtune-trainer:latest
                  name: node
                  resources:
                    limits:
                      nvidia.com/gpu: 2
                  volumeMounts:
                  - mountPath: /workspace/dataset
                    name: initializer
                  - mountPath: /workspace/model
                    name: initializer
                volumes:
                - name: initializer
                  persistentVolumeClaim:
                    claimName: initializer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also update images for data-initializer and model-initializer:)

spec:
containers:
- name: node
image: ghcr.io/kubeflow/trainer/torchtune-trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can use latest for now, but can we at least a proper overlay for the runtimes that override the image tag with latest like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18

@Electronic-Waste
Copy link
Member Author

@andreyvelich I've updated this PR with model-initializer and dataset-intializer overlay images. Please don't hesitate to tell me if there are any other suggestions:)

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @Electronic-Waste!
/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label May 10, 2025
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit abc4749 into kubeflow:master May 10, 2025
17 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.0 milestone May 10, 2025
akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
…beflow#2590)

* chore(manifests): add manifests for llama3_2_1B and llama3_2_3B.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(manifests): Add detailed configuration for llama3_2_1B.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(manifests): Add detailed configuration for llama3_2_3B.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(manifests): load pvc to trainer node & add initializers.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifest): fix DependsOn error in CI.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(runtime): fix launching command for CTRs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifest): remove torchtune temporally in base/runtimes.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifest): provide torchtune-trainer image tag override in overlays.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifest): update image for data-initializer and model-initializer.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family
6 participants