-
Notifications
You must be signed in to change notification settings - Fork 794
Open
Milestone
Description
This is the tracking issue for the Kubeflow LLM Trainer V2, a submodule of Kubeflow Training V2: #2170
We aim to solve:
- KEP-2170: Design Trainer for the LLM Runtimes #2321
- KEP-2170: Create LLM training runtime for Llama 3.2 8B #2212
However, the LLM Trainer V2 design is very complex and needs further discussion. So we decided to open a separate issue tracking it.
- Create the KEP: KEP-2401: Kubeflow LLM Trainer V2 #2410
- KEP-2401: Refactor current
train()
API #2503 - KEP-2401: Add
TorchTuneConfig
totrain()
API #2504 - KEP-2401: Support LoRA/QLoRA/DoRA fine-tuning in LLM Trainer V2 #2505
- KEP-2401: Support mutating dataset preprocessing config in SDK #2506
- KEP-2401: Complement
torch
plugin to supporttorchtune
config mutation #2507 - KEP-2401: Validate fine-tuning configurations in
torch
plugin #2508 - KEP-2401: Create LLM Training Runtimes for Llama 3.1 model family #2509
- KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family #2510
- KEP-2401: Create LLM Training Runtimes for Llama 3.3 model family #2591
- KEP-2401: Create
torchtune
trainer image #2511 - KEP-2401: Determine the tag for torchtune trainer & Add support for multiple accelerators #2518
- KEP-2401: Revisit PVC claim in torchtune CTRs when stateful jobset is ready #2630
- KEP-2401: Revisit DependsOn API in CTRs When Supporting Multiple Ancestor #2592
- KEP-2401: Support loading local LLMs #2641
- Create model exporter for checkpointing and training output #2245
Examples & User Documentation
- KEP-2401: Add Notebook examples for LLM Trainer V2 #2676
- trainer: Add user guide for BuiltinTrainer(TorchTune LLM Trainer) website#4146
KEP Updates:
- Runtime API: feat(doc): add Runtime API design in KEP-2401. #2501
- Train API: fix(doc): Update
train()
API in KEP-2401 #2536
Jobset Improvements:
Initial Design (Google Doc): Kubeflow Training V2 LLM Trainer Design
/area runtime
/cc @kubeflow/wg-training-leads @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 @franciscojavierarceo @thesuperzapper @rimolive @juliusvonkohout @jbottum @varodrig @Doris-xm @truc0
andreyvelich, franciscojavierarceo, astefanutti and saileshd1402franciscojavierarceo