You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently available metrics to monitor are listed below.
6
6
7
-
### Metrics for Each Component Container for TF operator
7
+
### Metrics for Each Component Container for TFJob
8
8
9
9
Component Containers:
10
-
* tf-operator
11
-
* tf-chief
12
-
* tf-ps
13
-
* tf-worker
10
+
11
+
- tf-operator
12
+
- tf-chief
13
+
- tf-ps
14
+
- tf-worker
14
15
15
16
#### Each Container Reports on its:
16
17
17
18
Use prometheus graph to run the following example commands to visualize metrics.
18
19
19
-
*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
20
+
_Note_: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
20
21
21
22
**CPU usage**
23
+
22
24
```
23
25
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
24
26
```
25
27
26
28
**GPU Usage**
29
+
27
30
```
28
31
sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
29
32
```
30
33
31
34
**Memory Usage**
35
+
32
36
```
33
37
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
34
38
```
35
39
36
40
**Network Usage**
41
+
37
42
```
38
43
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
39
44
```
40
45
41
46
**I/O Usage**
47
+
42
48
```
43
49
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
44
50
```
45
51
46
-
**Keep-Alive check**
52
+
**Keep-Alive check**
53
+
47
54
```
48
55
up
49
56
```
57
+
50
58
This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).
51
59
52
60
**Is Leader check**
61
+
53
62
```
54
63
tf_operator_is_leader
55
64
```
56
65
57
-
*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
66
+
_Note_: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
58
67
59
68
### Report TFJob metrics:
60
69
61
-
*Note*: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
70
+
_Note_: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
Copy file name to clipboardExpand all lines: docs/quick-start-v1.md
+26-25
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
# Testing v1
2
2
3
-
Tf-operator is currently in v1. The quick start shows an example of v1 of TF operator. For more details please refer to [developer_guide.md](../developer_guide.md).
3
+
TFJob is currently in v1. The quick start shows an example of TFJob.
4
+
For more details please refer to [developer_guide.md](../developer_guide.md).
4
5
5
6
## Create a TFJob
6
7
@@ -38,12 +39,12 @@ spec:
38
39
creationTimestamp: null
39
40
spec:
40
41
containers:
41
-
- image: kubeflow/tf-dist-mnist-test:1.0
42
-
name: tensorflow
43
-
ports:
44
-
- containerPort: 2222
45
-
name: tfjob-port
46
-
resources: {}
42
+
- image: kubeflow/tf-dist-mnist-test:1.0
43
+
name: tensorflow
44
+
ports:
45
+
- containerPort: 2222
46
+
name: tfjob-port
47
+
resources: {}
47
48
Worker:
48
49
replicas: 4
49
50
restartPolicy: Never
@@ -52,26 +53,26 @@ spec:
52
53
creationTimestamp: null
53
54
spec:
54
55
containers:
55
-
- image: kubeflow/tf-dist-mnist-test:1.0
56
-
name: tensorflow
57
-
ports:
58
-
- containerPort: 2222
59
-
name: tfjob-port
60
-
resources: {}
56
+
- image: kubeflow/tf-dist-mnist-test:1.0
57
+
name: tensorflow
58
+
ports:
59
+
- containerPort: 2222
60
+
name: tfjob-port
61
+
resources: {}
61
62
status:
62
63
conditions:
63
-
- lastTransitionTime: 2019-03-06T09:50:36Z
64
-
lastUpdateTime: 2019-03-06T09:50:36Z
65
-
message: TFJob dist-mnist-for-e2e-test is created.
66
-
reason: TFJobCreated
67
-
status: "True"
68
-
type: Created
69
-
- lastTransitionTime: 2019-03-06T09:50:57Z
70
-
lastUpdateTime: 2019-03-06T09:50:57Z
71
-
message: TFJob dist-mnist-for-e2e-test is running.
72
-
reason: TFJobRunning
73
-
status: "True"
74
-
type: Running
64
+
- lastTransitionTime: 2019-03-06T09:50:36Z
65
+
lastUpdateTime: 2019-03-06T09:50:36Z
66
+
message: TFJob dist-mnist-for-e2e-test is created.
67
+
reason: TFJobCreated
68
+
status: "True"
69
+
type: Created
70
+
- lastTransitionTime: 2019-03-06T09:50:57Z
71
+
lastUpdateTime: 2019-03-06T09:50:57Z
72
+
message: TFJob dist-mnist-for-e2e-test is running.
0 commit comments