Skip to content

Commit 835eb4d

Browse files
committed
Fix legacy docs
1 parent fdc86d1 commit 835eb4d

File tree

4 files changed

+70
-48
lines changed

4 files changed

+70
-48
lines changed

docs/monitoring/README.md

+25-10
Original file line numberDiff line numberDiff line change
@@ -1,91 +1,106 @@
1-
# Prometheus Monitoring for TF operator
1+
# Prometheus Monitoring for TFJob
22

33
## Available Metrics
44

55
Currently available metrics to monitor are listed below.
66

7-
### Metrics for Each Component Container for TF operator
7+
### Metrics for Each Component Container for TFJob
88

99
Component Containers:
10-
* tf-operator
11-
* tf-chief
12-
* tf-ps
13-
* tf-worker
10+
11+
- tf-operator
12+
- tf-chief
13+
- tf-ps
14+
- tf-worker
1415

1516
#### Each Container Reports on its:
1617

1718
Use prometheus graph to run the following example commands to visualize metrics.
1819

19-
*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
20+
_Note_: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.
2021

2122
**CPU usage**
23+
2224
```
2325
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
2426
```
2527

2628
**GPU Usage**
29+
2730
```
2831
sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
2932
```
3033

3134
**Memory Usage**
35+
3236
```
3337
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
3438
```
3539

3640
**Network Usage**
41+
3742
```
3843
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
3944
```
4045

4146
**I/O Usage**
47+
4248
```
4349
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
4450
```
4551

46-
**Keep-Alive check**
52+
**Keep-Alive check**
53+
4754
```
4855
up
4956
```
57+
5058
This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).
5159

5260
**Is Leader check**
61+
5362
```
5463
tf_operator_is_leader
5564
```
5665

57-
*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
66+
_Note_: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.
5867

5968
### Report TFJob metrics:
6069

61-
*Note*: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
70+
_Note_: If you are using release v1 tf-operator, these TFJob metrics don't have suffix `total`. So you have to use metric name like `tf_operator_jobs_created` to get your metrics. See [PR](https://github.com/kubeflow/training-operator/pull/1055) to get more information.
6271

6372
**Job Creation**
73+
6474
```
6575
tf_operator_jobs_created_total
6676
```
6777

6878
**Job Creation**
79+
6980
```
7081
sum (rate (tf_operator_jobs_created_total[60m]))
7182
```
7283

7384
**Job Deletion**
85+
7486
```
7587
tf_operator_jobs_deleted_total
7688
```
7789

7890
**Successful Job Completions**
91+
7992
```
8093
tf_operator_jobs_successful_total
8194
```
8295

8396
**Failed Jobs**
97+
8498
```
8599
tf_operator_jobs_failed_total
86100
```
87101

88102
**Restarted Jobs**
103+
89104
```
90105
tf_operator_jobs_restarted_total
91106
```

docs/quick-start-v1.md

+26-25
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Testing v1
22

3-
Tf-operator is currently in v1. The quick start shows an example of v1 of TF operator. For more details please refer to [developer_guide.md](../developer_guide.md).
3+
TFJob is currently in v1. The quick start shows an example of TFJob.
4+
For more details please refer to [developer_guide.md](../developer_guide.md).
45

56
## Create a TFJob
67

@@ -38,12 +39,12 @@ spec:
3839
creationTimestamp: null
3940
spec:
4041
containers:
41-
- image: kubeflow/tf-dist-mnist-test:1.0
42-
name: tensorflow
43-
ports:
44-
- containerPort: 2222
45-
name: tfjob-port
46-
resources: {}
42+
- image: kubeflow/tf-dist-mnist-test:1.0
43+
name: tensorflow
44+
ports:
45+
- containerPort: 2222
46+
name: tfjob-port
47+
resources: {}
4748
Worker:
4849
replicas: 4
4950
restartPolicy: Never
@@ -52,26 +53,26 @@ spec:
5253
creationTimestamp: null
5354
spec:
5455
containers:
55-
- image: kubeflow/tf-dist-mnist-test:1.0
56-
name: tensorflow
57-
ports:
58-
- containerPort: 2222
59-
name: tfjob-port
60-
resources: {}
56+
- image: kubeflow/tf-dist-mnist-test:1.0
57+
name: tensorflow
58+
ports:
59+
- containerPort: 2222
60+
name: tfjob-port
61+
resources: {}
6162
status:
6263
conditions:
63-
- lastTransitionTime: 2019-03-06T09:50:36Z
64-
lastUpdateTime: 2019-03-06T09:50:36Z
65-
message: TFJob dist-mnist-for-e2e-test is created.
66-
reason: TFJobCreated
67-
status: "True"
68-
type: Created
69-
- lastTransitionTime: 2019-03-06T09:50:57Z
70-
lastUpdateTime: 2019-03-06T09:50:57Z
71-
message: TFJob dist-mnist-for-e2e-test is running.
72-
reason: TFJobRunning
73-
status: "True"
74-
type: Running
64+
- lastTransitionTime: 2019-03-06T09:50:36Z
65+
lastUpdateTime: 2019-03-06T09:50:36Z
66+
message: TFJob dist-mnist-for-e2e-test is created.
67+
reason: TFJobCreated
68+
status: "True"
69+
type: Created
70+
- lastTransitionTime: 2019-03-06T09:50:57Z
71+
lastUpdateTime: 2019-03-06T09:50:57Z
72+
message: TFJob dist-mnist-for-e2e-test is running.
73+
reason: TFJobRunning
74+
status: "True"
75+
type: Running
7576
replicaStatuses:
7677
PS:
7778
active: 2

docs/testing/e2e_testing.md

+17-11
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
# How to Write an E2E Test for TF Operator
1+
# How to Write an E2E Test for Kubeflow Training Operator
22

3-
The E2E tests for TF operator are implemented as Argo workflows. For more background and details
3+
The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details
44
about Argo (not required for understanding the rest of this document), please take a look at
55
[this link](https://github.com/kubeflow/testing/blob/master/README.md).
66

77
Test results can be monitored at the [Prow dashboard](https://prow.k8s.io/?repo=kubeflow%2Ftraining-operator).
88

99
At a high level, the E2E test suites are structured as Python test classes. Each test class contains
1010
one or more tests. A test typically runs the following:
11-
* Create a ksonnet component using a TFJob spec;
12-
* Creates the specified TFJob;
13-
* Verifies some expected results (e.g. number of pods started, job status);
14-
* Deletes the TFJob.
1511

12+
- Create a ksonnet component using a TFJob spec;
13+
- Creates the specified TFJob;
14+
- Verifies some expected results (e.g. number of pods started, job status);
15+
- Deletes the TFJob.
1616

1717
## Adding a Test Method
1818

@@ -23,11 +23,12 @@ starting or deleting a TFJob), and performs verifications of expected results (e
2323
correct status, pods are deleted, etc).
2424

2525
Test classes should follow this pattern:
26+
2627
```python
2728
class MyTest(test_util.TestCase):
2829
def __init__(self, args):
2930
# Initialize environment
30-
31+
3132
def test_case_1(self):
3233
# Test code
3334

@@ -40,17 +41,18 @@ if __name__ == "__main__"
4041

4142
The code here ideally should only contain API calls. Any common functionalities used by the test code should
4243
be added to one of the helper modules:
43-
* k8s_util - for K8s operations like querying/deleting a pod
44-
* ks_util - for ksonnet operations
45-
* tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase
44+
45+
- k8s_util - for K8s operations like querying/deleting a pod
46+
- ks_util - for ksonnet operations
47+
- tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase
4648

4749
## Adding a TFJob Spec
4850

4951
This is needed if you want to use your own TFJob spec instead of an existing one. An example can be found
5052
[here](https://github.com/kubeflow/training-operator/tree/master/test/workflows/components/simple_tfjob_v1.jsonnet).
5153
All TFJob specs should be placed in the same directory.
5254

53-
These are similar to actual TFJob specs. Note that many of these are using the
55+
These are similar to actual TFJob specs. Note that many of these are using the
5456
[training-operator-test-server](https://github.com/kubeflow/training-operator/tree/master/test/test-server) as the test image.
5557
This gives us more control over when each replica exits, and allows us to send specific requests like fetching the
5658
runtime TensorFlow config.
@@ -64,19 +66,23 @@ New test classes should be added as Argo workflow steps to the
6466
[workflows.libsonnet](https://github.com/kubeflow/training-operator/blob/master/test/workflows/components/workflows.libsonnet) file.
6567

6668
Under the templates section, add the following to the dag:
69+
6770
```
6871
{
6972
name: "my-test",
7073
template: "my-test",
7174
dependencies: ["setup-kubeflow"],
7275
},
7376
```
77+
7478
This will configure Argo to run `my-test` after setting up the Kubeflow cluster.
7579

7680
Next, add the following lines toward the end of the file:
81+
7782
```
7883
$.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTestTemplate(
7984
"my-test"),
8085
```
86+
8187
This assumes that there is a corresponding Python file named `my_test.py` (note the difference between dashes and
8288
underscores).

scripts/setup-tf-operator.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,11 @@ GO_DIR=${GOPATH}/src/github.com/${REPO_OWNER}/${REPO_NAME}
3030
echo "Configuring kubeconfig.."
3131
aws eks update-kubeconfig --region=${REGION} --name=${CLUSTER_NAME}
3232

33-
echo "Update tf operator manifest with new name $REGISTRY and tag $VERSION"
33+
echo "Update Training Operator manifest with new name $REGISTRY and tag $VERSION"
3434
cd manifests/overlays/standalone
3535
kustomize edit set image public.ecr.aws/j1r0q0g6/training/training-operator=${REGISTRY}:${VERSION}
3636

37-
echo "Installing tf operator manifests"
37+
echo "Installing Training Operator manifests"
3838
kustomize build . | kubectl apply -f -
3939

4040
TIMEOUT=30

0 commit comments

Comments
 (0)