Skip to content

Commit 8a60919

Browse files
committed
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design
Signed-off-by: Yuki Iwai <[email protected]>
1 parent e7e35d1 commit 8a60919

File tree

3 files changed

+63
-0
lines changed

3 files changed

+63
-0
lines changed

docs/proposals/2170-kubeflow-training-v2/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._
16921692

16931693
_Will be added after initial implementation for PyTorch._
16941694

1695+
## Pipeline Framework
1696+
1697+
We introduce the framework as internal mechanism so that we can easily expand mechanism
1698+
for combination of Runtimes and TrainJob.
1699+
1700+
The framework is called as Kubeflow Trainer Pipeline Framework, and it has 4 phases as you can see the following
1701+
overview.
1702+
1703+
![Overview](./TrainerPipelineFrameworkOverview.drawio.svg)
1704+
1705+
As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once
1706+
during starting trainer-controller-manager:
1707+
1708+
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
1709+
- `PreExecution Phase`: This phase is executed as a part of admission validating webhooks triggered by TrainJob is created and updated.
1710+
- `Build Phase`: This phase is executed to build child Kubernetes resources and deploy those to the cluster.
1711+
- `PostExecution Phase`: This phase is executed after the `Build Phase`.
1712+
1713+
As you can see in the diagram, each phase has 2 types of APIs, `Internal API` and `Extension Point`.
1714+
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
1715+
and those plugins are performed in any order.
1716+
On the other hand, the Internal APIs are not exposed and could not add any operations as opposed to the Extension Point.
1717+
1718+
![Kubeflow TrainerPipelineFramework](./TrainerPipelineFramework.drawio.svg)
1719+
1720+
- `Startup Phase`:
1721+
- Internal API:
1722+
- `TrainJobController`: Set up TrainJob controller and register it to Manager.
1723+
- `Built-in Webhook Servers`: Set up Built-in Admission Webhook Servers and register those to Manager.
1724+
- `Start Manager`: Start Manager.
1725+
- Extension Point
1726+
- `WatchExtension`: This registers arbitrary reconciler builders for watching any kind of resources
1727+
and triggering TrainJob reconciliations.
1728+
- `PreExecution Phase`:
1729+
- Extension Point:
1730+
- `CustomValidation`: This registers validators for validating any kind of resources to Admission Validating Webhook Servers
1731+
when TrainJob is created and updated.
1732+
- `Build Phase`:
1733+
- Internal API:
1734+
- `ComponentDeployer`: This deploys built components (resources) to the cluster which is performed as a part of reconciler.
1735+
- Extension Point:
1736+
- `EnforcePodGroupPolicy`: This configures PodGroup specific parameters (e.x, specified in TrainingRuntime `.spec.podGroupPolicy`)
1737+
to any kind of resources like PodSpec.
1738+
- `EnforceMLPolicy`: This configure MachineLearning framework specific parameters (e.x, specified in TrainingRuntime `.spec.mlPolicy`)
1739+
to any kind of resources like PodSpec.
1740+
- `ComponentBuilder`: This builds Kubernetes resources leveraging `RuntimeInfo` and `TrainJob`.
1741+
`RuntimeInfo` is abstracted objects extracted from runtimes like TrainingRuntime and ClusterTrainingRuntime.
1742+
- `PostExecution Phase`:
1743+
- Internal API:
1744+
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob.
1745+
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob.
1746+
- Extension Point:
1747+
- `TerminalCondition`: Check if TrainJob is terminated state, and then add `Complete` condition with
1748+
a propagated terminal reason and message from child Jobs to TrainJob.
1749+
16951750
## Migration from Kubeflow Training V1
16961751

16971752
These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have

docs/proposals/2170-kubeflow-training-v2/TrainerPipelineFramework.drawio.svg

Lines changed: 4 additions & 0 deletions
Loading

docs/proposals/2170-kubeflow-training-v2/TrainerPipelineFrameworkOverview.drawio.svg

Lines changed: 4 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)