-
Notifications
You must be signed in to change notification settings - Fork 796
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2590
Conversation
Pull Request Test Coverage Report for Build 14922684382Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo |
volumes: | ||
- name: initializer | ||
persistentVolumeClaim: | ||
claimName: initializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is it a prerequisite to have this PVC created? Or is it meant to be overwritten somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to indicate that, we'll use default storageclass to create this PVC if this PVC does not exist:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's useful to add that in a comment? Or is this documented somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhhh... It's the default processing logic for k8s. You can refer to: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1 for more information:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying it's the default behavior Kubernetes creates the PVC automatically?
I haven't asked about the default StorageClass, I've asked about the PVC :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying it's the default behavior Kubernetes creates the PVC automatically?
Yeah (if PVC does not exist). Maybe I didn't get your point then.
I've asked about the PVC
Do you mean that we might reuse the PVC left by previous training tasks by accident? From my perspective, it does not matter since we have different dirs for the downloaded datasets and models (e.g. /workspace/model/Llama-3.2-1B-Instruct
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572
@Electronic-Waste Who will be responsible to manage the initializer
PVC at the user's namespace ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572
I see... Sorry for the confusion. I once considered that JobSet will automatically create PVC if it does not exist.
I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572
Maybe we can temporarily create the initializer
PVC in our SDK when using BuiltinTrainer?
trainer/sdk/kubeflow/trainer/api/trainer_client.py
Lines 194 to 197 in 08232da
# If users choose to use a builtin trainer for post-training. | |
elif isinstance(trainer, types.BuiltinTrainer): | |
trainer_crd = utils.get_trainer_crd_from_builtin_trainer(trainer) | |
Do you have any other thought? @andreyvelich @tenzen-y @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, for now, let's create this as part of SDK.
We should create it in the controller, since the TrainJobs can be triggered via YAML as well.
We should verify that PVC exists in case of Cluster Admins want to pre-create it for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for now, we can just create it in the torch
plugin as a temporarily solution.
Any concerns @tenzen-y ?
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kannon92. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold for kubernetes-sigs/jobset#874 |
/hold cancel |
volumes: | ||
- name: initializer | ||
persistentVolumeClaim: | ||
claimName: initializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it might be tricky without support of stateful JobSet which should orchestrate volume claims: kubernetes-sigs/jobset#572
@Electronic-Waste Who will be responsible to manage the initializer
PVC at the user's namespace ?
echo "TorchTune Llama3.2-1B Finetuning Runtime" | ||
|
||
echo "--------------------------------------" | ||
pip list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed before, by default these runtimes should contain the correct tune run
command to kick off the TrainJob as follows:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: fine-tune-llama-3.2
spec:
runtimeRef:
name: torchtune-llama3.2-1b
Since you are going to override the command
in torch plugin anyway, why do you need to have this command
?
I think, here you should also check that the runtime belongs to torchtune
: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are going to override the command in torch plugin anyway, why do you need to have this command?
Previously, we assume that users will use SDK to launch the training task: https://github.com/kubeflow/trainer/blob/master/docs/proposals/2401-llm-trainer-v2/README.md#propagte-torchtune-settings-with-sdk
And they may use the TrainJob you mentioned to inspect the package dependency in our torchtune-trainer
. Do you prefer also supporting launching tasks with yaml+CTR besides SDK+CTR? @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no differences in TrainJob YAML when user runs it from SDK or from kubectl
.
As you can see, we only add the Trainer API when the trainer
is defined by user: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/api/trainer_client.py#L196.
So I suggest, that we add the appropriate command in those Runtimes, so user can just trigger them as:
TrainerClient().train(
runtime=Runtime(name="torchtune-llama3.3-70b"),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update it soon:)
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
e768d45
to
5baf306
Compare
/retest |
I'll add PVC check later. Do you prefer adding the validation in this PR or the following PR? @andreyvelich |
/retest |
- mountPath: /workspace/dataset | ||
name: initializer | ||
- mountPath: /workspace/model | ||
name: initializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we will tell tune run
to use model and dataset from this location rather than downloading it when we run tune run ....
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it in this issue: #2506
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image: ghcr.io/kubeflow/trainer/model-initializer | ||
env: | ||
- name: STORAGE_URI | ||
value: hf://meta-llama/Llama-3.2-1B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, I think without the Token you can't download Llama models right ?
Shall we add the comment in this Runtime that user has to insert HF token here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should ask users to add HF_TOKEN
env in the TrainJob yaml or the initializer
arg of train()
API in user guide. They may not notice the comment in this Runtime:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, we can add it to the docs
/rerun-all |
@@ -5,3 +5,4 @@ resources: | |||
- mlx_distributed.yaml | |||
- mpi_distributed.yaml | |||
- torch_distributed.yaml | |||
- torchtune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that runtimes are not fully functional yet, can we exclude them from the base kustomization file please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll remove it soon:)
spec: | ||
containers: | ||
- name: node | ||
image: ghcr.io/kubeflow/trainer/torchtune-trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Electronic-Waste Can we add the patch in the overlay to update the trainer image tag ?
Since for the release we should use 2.0.0
tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, maybe we need to use the latest
tag instead of 2.0.0
tag since this PR is not ready: #2623
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for the release, we have to create patch to add appropriate tag for our runtimes + trainers.
@Doris-xm @Electronic-Waste Do we want to do that as part of this: #2527 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate issue/PR might be better, since it's related to release management CI. And users may fail to pull the images if we update it before the v2.0 release. So, my suggestion is that we can use the latest
tag for these PRs: #2590, #2527, #2382. And change CI and all tags altogether in the new PR. WDYT? @andreyvelich @Doris-xm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can use latest
for now, but can we at least a proper overlay for the runtimes that override the image tag with latest
like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
@@ -2,3 +2,8 @@ apiVersion: kustomize.config.k8s.io/v1beta1 | |||
kind: Kustomization | |||
resources: | |||
- ../../base/runtimes | |||
|
|||
# Update the Kubeflow LLM Trainer image tag. | |||
images: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Electronic-Waste Please can you verify if it is going to work ?
I thought, Kustomize can only update images for the standarts k8s resources (e.g. Deployment), not CRDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a try. If it does not work, I'll add a patch for this:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for CRD:
$ kustomize build manifests/overlays/runtimes > out.yaml
<one of CRDs>
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torchtune-llama3.2-1b
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: 2
template:
spec:
replicatedJobs:
- name: dataset-initializer
template:
spec:
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer
spec:
containers:
- env:
- name: STORAGE_URI
value: hf://tatsu-lab/alpaca
image: ghcr.io/kubeflow/trainer/dataset-initializer
name: dataset-initializer
volumeMounts:
- mountPath: /workspace/dataset
name: initializer
volumes:
- name: initializer
persistentVolumeClaim:
claimName: initializer
- dependsOn:
- name: dataset-initializer
status: Complete
name: model-initializer
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
spec:
template:
spec:
containers:
- env:
- name: STORAGE_URI
value: hf://meta-llama/Llama-3.2-1B-Instruct
image: ghcr.io/kubeflow/trainer/model-initializer
name: model-initializer
volumeMounts:
- mountPath: /workspace/model
name: initializer
volumes:
- name: initializer
persistentVolumeClaim:
claimName: initializer
- dependsOn:
- name: model-initializer
status: Complete
name: node
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- command:
- tune
- run
- full_finetune_distributed
- --config llama3_2/1B_full.yaml
image: ghcr.io/kubeflow/trainer/torchtune-trainer:latest
name: node
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- mountPath: /workspace/dataset
name: initializer
- mountPath: /workspace/model
name: initializer
volumes:
- name: initializer
persistentVolumeClaim:
claimName: initializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also update images for data-initializer
and model-initializer
:)
spec: | ||
containers: | ||
- name: node | ||
image: ghcr.io/kubeflow/trainer/torchtune-trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can use latest
for now, but can we at least a proper overlay for the runtimes that override the image tag with latest
like here: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/manager/kustomization.yaml#L16-L18
Signed-off-by: Electronic-Waste <[email protected]>
@andreyvelich I've updated this PR with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @Electronic-Waste!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…beflow#2590) * chore(manifests): add manifests for llama3_2_1B and llama3_2_3B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Add detailed configuration for llama3_2_1B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Add detailed configuration for llama3_2_3B. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): load pvc to trainer node & add initializers. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): fix DependsOn error in CI. Signed-off-by: Electronic-Waste <[email protected]> * fix(runtime): fix launching command for CTRs. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): remove torchtune temporally in base/runtimes. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): provide torchtune-trainer image tag override in overlays. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifest): update image for data-initializer and model-initializer. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
What this PR does / why we need it:
This PR adds the CTRs for Llama 3.2 model family, including:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #2510
Checklist: