@@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._
1692
1692
1693
1693
_Will be added after initial implementation for PyTorch._
1694
1694
1695
+ # # Pipeline Framework
1696
+
1697
+ We introduce the framework as internal mechanism so that we can easily expand mechanism
1698
+ for combination of Runtimes and TrainJob.
1699
+
1700
+ The framework is called as Kubeflow Trainer Pipeline Framework, and it has 4 phases as you can see the following
1701
+ overview.
1702
+
1703
+ 
1704
+
1705
+ As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once
1706
+ during starting trainer-controller-manager :
1707
+
1708
+ - `Startup Phase` : Initialize internal components at once when the trainer-controller-manager starts.
1709
+ - `PreExecution Phase` : This phase is executed as a part of admission validating webhooks triggered by TrainJob is created and updated.
1710
+ - `Build Phase` : This phase is executed to build child Kubernetes resources and deploy those to the cluster.
1711
+ - `PostExecution Phase` : This phase is executed after the `Build Phase`.
1712
+
1713
+ As you can see in the diagram, each phase has 2 types of APIs, `Internal API` and `Extension Point`.
1714
+ The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
1715
+ and those plugins are performed in any order.
1716
+ On the other hand, the Internal APIs are not exposed and could not add any operations as opposed to the Extension Point.
1717
+
1718
+ 
1719
+
1720
+ - `Startup Phase` :
1721
+ - Internal API :
1722
+ - `TrainJobController` : Set up TrainJob controller and register it to Manager.
1723
+ - `Built-in Webhook Servers` : Set up Built-in Admission Webhook Servers and register those to Manager.
1724
+ - `Start Manager` : Start Manager.
1725
+ - Extension Point
1726
+ - `WatchExtension` : This registers arbitrary reconciler builders for watching any kind of resources
1727
+ and triggering TrainJob reconciliations.
1728
+ - `PreExecution Phase` :
1729
+ - Extension Point :
1730
+ - `CustomValidation` : This registers validators for validating any kind of resources to Admission Validating Webhook Servers
1731
+ when TrainJob is created and updated.
1732
+ - `Build Phase` :
1733
+ - Internal API :
1734
+ - `ComponentDeployer` : This deploys built components (resources) to the cluster which is performed as a part of reconciler.
1735
+ - Extension Point :
1736
+ - `EnforcePodGroupPolicy` : This configures PodGroup specific parameters (e.x, specified in TrainingRuntime `.spec.podGroupPolicy`)
1737
+ to any kind of resources like PodSpec.
1738
+ - `EnforceMLPolicy` : This configure MachineLearning framework specific parameters (e.x, specified in TrainingRuntime `.spec.mlPolicy`)
1739
+ to any kind of resources like PodSpec.
1740
+ - `ComponentBuilder` : This builds Kubernetes resources leveraging `RuntimeInfo` and `TrainJob`.
1741
+ ` RuntimeInfo` is abstracted objects extracted from runtimes like TrainingRuntime and ClusterTrainingRuntime.
1742
+ - `PostExecution Phase` :
1743
+ - Internal API :
1744
+ - `SupendedCondition` : Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob.
1745
+ - `CreatedConditon` : Check if TrainJob is created state, and then add `Created` condition to TrainJob.
1746
+ - Extension Point :
1747
+ - `TerminalCondition` : Check if TrainJob is terminated state, and then add `Complete` condition with
1748
+ a propagated terminal reason and message from child Jobs to TrainJob.
1749
+
1695
1750
# # Migration from Kubeflow Training V1
1696
1751
1697
1752
These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have
0 commit comments