Skip to content

Commit 93bf10d

Browse files
authored
Katib: Add Hyperparameter Tuning Architecture (#3688)
* Katib: Add HyperParameter Tuning Architecture Signed-off-by: Andrey Velichkevich <[email protected]> * Remove CRD label from diagram Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
1 parent 996b465 commit 93bf10d

File tree

2 files changed

+106
-91
lines changed

2 files changed

+106
-91
lines changed

content/en/docs/components/katib/images/katib-architecture.drawio.svg

Lines changed: 4 additions & 0 deletions
Loading

content/en/docs/components/katib/overview.md

Lines changed: 102 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,103 @@ various AutoML algorithms.
4646
alt="Katib Overview"
4747
class="mt-3 mb-3">
4848

49+
This diagram shows how Katib performs Hyperparameter tuning:
50+
51+
<img src="/docs/components/katib/images/katib-architecture.drawio.svg"
52+
alt="Katib Overview"
53+
class="mt-3 mb-3">
54+
55+
First of all, user need to write ML training code which will be evaluated on every Katib Trial
56+
with different Hyperparameters. Then, using Katib Python SDK user should set objective, search
57+
space, search algorithm, Trial resources, and create the Katib Experiment.
58+
59+
Follow the [quickstart guide](/docs/components/katib/hyperparameter/#quickstart-with-katib-sdk)
60+
to create your first Katib Experiment.
61+
62+
Katib implements the following Custom Resource Definitions (CRDs) to tune Hyperparameters:
63+
64+
### Experiment
65+
66+
An _Experiment_ is a single tuning run, also called an optimization run.
67+
68+
You specify configuration settings to define the Experiment. The following are
69+
the main configurations:
70+
71+
- **Objective**: What you want to optimize. This is the objective metric, also
72+
called the target variable. A common metric is the model's accuracy
73+
in the validation pass of the training job (_validation-accuracy_). You also
74+
specify whether you want the hyperparameter tuning job to _maximize_ or
75+
_minimize_ the metric.
76+
77+
- **Search space**: The set of all possible hyperparameter values that the
78+
hyperparameter tuning job should consider for optimization, and the
79+
constraints for each hyperparameter. Other names for search space include
80+
_feasible set_ and _solution space_. For example, you may provide the
81+
names of the hyperparameters that you want to optimize. For each
82+
hyperparameter, you may provide a _minimum_ and _maximum_ value or a _list_
83+
of allowable values.
84+
85+
- **Search algorithm**: The algorithm to use when searching for the optimal
86+
hyperparameter values.
87+
88+
Katib Experiment is defined as a
89+
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
90+
91+
For details of how to define your Experiment, follow the guide to [running an
92+
experiment](/docs/components/katib/experiment/).
93+
94+
### Suggestion
95+
96+
A _Suggestion_ is a set of hyperparameter values that the hyperparameter
97+
tuning process has proposed. Katib creates a Trial to evaluate the suggested
98+
set of values.
99+
100+
Katib Suggestion is defined as a
101+
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
102+
103+
### Trial
104+
105+
A _Trial_ is one iteration of the hyperparameter tuning process. A Trial
106+
corresponds to one worker job instance with a list of parameter assignments.
107+
The list of parameter assignments corresponds to a Suggestion.
108+
109+
Each Experiment runs several Trials. The Experiment runs the Trials until it
110+
reaches either the objective or the configured maximum number of Trials.
111+
112+
Katib trial is defined as a
113+
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
114+
115+
### Worker job
116+
117+
The _worker job_ is the process that runs to evaluate a Trial and calculate
118+
its objective value.
119+
120+
The worker job can be any type of Kubernetes resource or
121+
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).
122+
Follow the
123+
[Trial template guide](/docs/components/katib/trial-template/#custom-resource)
124+
to check how to support your own Kubernetes resource in Katib.
125+
126+
Katib has these CRD examples in upstream:
127+
128+
- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
129+
130+
- [Kubeflow `TFJob`](/docs/components/training/tftraining/)
131+
132+
- [Kubeflow `PyTorchJob`](/docs/components/training/pytorch/)
133+
134+
- [Kubeflow `MXJob`](/docs/components/training/mxnet)
135+
136+
- [Kubeflow `XGBoostJob`](/docs/components/training/xgboost)
137+
138+
- [Kubeflow `MPIJob`](/docs/components/training/mpi)
139+
140+
- [Tekton `Pipelines`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/tekton)
141+
142+
- [Argo `Workflows`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo)
143+
144+
By offering the above worker job types, Katib supports multiple ML frameworks.
145+
49146
## Hyperparameters and hyperparameter tuning
50147

51148
_Hyperparameters_ are the variables that control the model training process.
@@ -82,12 +179,12 @@ layers, and the optimizer):
82179
_(To run the example that produced this graph, follow the [getting-started
83180
guide](/docs/components/katib/hyperparameter/).)_
84181

85-
Katib runs several training jobs (known as _trials_) within each
86-
hyperparameter tuning job (_experiment_). Each trial tests a different set of
87-
hyperparameter configurations. At the end of the experiment, Katib outputs
182+
Katib runs several training jobs (known as _Trials_) within each
183+
hyperparameter tuning job (_Experiment_). Each Trial tests a different set of
184+
hyperparameter configurations. At the end of the Experiment, Katib outputs
88185
the optimized values for the hyperparameters.
89186

90-
You can improve your hyperparameter tunning experiments by using
187+
You can improve your hyperparameter tuning Experiments by using
91188
[early stopping](https://en.wikipedia.org/wiki/Early_stopping) techniques.
92189
Follow the [early stopping guide](/docs/components/katib/early-stopping/)
93190
for the details.
@@ -125,7 +222,7 @@ part of the form for submitting a NAS job from the Katib UI:
125222

126223
You can use the following interfaces to interact with Katib:
127224

128-
- A web UI that you can use to submit experiments and to monitor your results.
225+
- A web UI that you can use to submit Experiments and to monitor your results.
129226
Check the [getting-started
130227
guide](/docs/components/katib/hyperparameter/#katib-ui)
131228
for information on how to access the UI.
@@ -145,92 +242,6 @@ You can use the following interfaces to interact with Katib:
145242

146243
- Katib Python SDK. Check the [Katib Python SDK documentation on GitHub](https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1).
147244

148-
## Katib concepts
149-
150-
This section describes the terms used in Katib.
151-
152-
### Experiment
153-
154-
An _experiment_ is a single tuning run, also called an optimization run.
155-
156-
You specify configuration settings to define the experiment. The following are
157-
the main configurations:
158-
159-
- **Objective**: What you want to optimize. This is the objective metric, also
160-
called the target variable. A common metric is the model's accuracy
161-
in the validation pass of the training job (_validation-accuracy_). You also
162-
specify whether you want the hyperparameter tuning job to _maximize_ or
163-
_minimize_ the metric.
164-
165-
- **Search space**: The set of all possible hyperparameter values that the
166-
hyperparameter tuning job should consider for optimization, and the
167-
constraints for each hyperparameter. Other names for search space include
168-
_feasible set_ and _solution space_. For example, you may provide the
169-
names of the hyperparameters that you want to optimize. For each
170-
hyperparameter, you may provide a _minimum_ and _maximum_ value or a _list_
171-
of allowable values.
172-
173-
- **Search algorithm**: The algorithm to use when searching for the optimal
174-
hyperparameter values.
175-
176-
Katib experiment is defined as a
177-
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
178-
179-
For details of how to define your experiment, follow the guide to [running an
180-
experiment](/docs/components/katib/experiment/).
181-
182-
### Suggestion
183-
184-
A _suggestion_ is a set of hyperparameter values that the hyperparameter
185-
tuning process has proposed. Katib creates a trial to evaluate the suggested
186-
set of values.
187-
188-
Katib suggestion is defined as a
189-
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
190-
191-
### Trial
192-
193-
A _trial_ is one iteration of the hyperparameter tuning process. A trial
194-
corresponds to one worker job instance with a list of parameter assignments.
195-
The list of parameter assignments corresponds to a suggestion.
196-
197-
Each experiment runs several trials. The experiment runs the trials until it
198-
reaches either the objective or the configured maximum number of trials.
199-
200-
Katib trial is defined as a
201-
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) .
202-
203-
### Worker job
204-
205-
The _worker job_ is the process that runs to evaluate a trial and calculate
206-
its objective value.
207-
208-
The worker job can be any type of Kubernetes resource or
209-
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).
210-
Follow the
211-
[trial template guide](/docs/components/katib/trial-template/#custom-resource)
212-
to check how to support your own Kubernetes resource in Katib.
213-
214-
Katib has these CRD examples in upstream:
215-
216-
- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
217-
218-
- [Kubeflow `TFJob`](/docs/components/training/tftraining/)
219-
220-
- [Kubeflow `PyTorchJob`](/docs/components/training/pytorch/)
221-
222-
- [Kubeflow `MXJob`](/docs/components/training/mxnet)
223-
224-
- [Kubeflow `XGBoostJob`](/docs/components/training/xgboost)
225-
226-
- [Kubeflow `MPIJob`](/docs/components/training/mpi)
227-
228-
- [Tekton `Pipelines`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/tekton)
229-
230-
- [Argo `Workflows`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/argo)
231-
232-
By offering the above worker job types, Katib supports multiple ML frameworks.
233-
234245
## Next steps
235246

236247
Follow the [getting-started guide](/docs/components/katib/hyperparameter/)

0 commit comments

Comments
 (0)