Skip to content

Commit 7866d64

Browse files
authored
docs: add simple example for network field (kubernetes-sigs#550)
* add simple example for network field * add Usage example for Demo * fix: readme desc * fix features overview docs desc * add space for comments Signed-off-by: googs1025 <[email protected]> * add troubleshooting docs Signed-off-by: googs1025 <[email protected]> * add ping in container command * fix yaml jobset worker command * fix: use leader dns * fix dns hostname --------- Signed-off-by: googs1025 <[email protected]>
1 parent cc4890b commit 7866d64

File tree

5 files changed

+171
-5
lines changed

5 files changed

+171
-5
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Take a look at the [concepts](https://jobset.sigs.k8s.io/docs/concepts/) page fo
1717

1818
- **Support for multi-template jobs**: JobSet models a distributed training workload as a group of K8s Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.), something which cannot be done by a single Job.
1919

20-
- **Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames.
20+
- **Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames. These networking configurations are defaulted automatically to enable stable network endpoints and pod-to-pod communication via hostnames; however, they can be customized in the JobSet spec: see this [example](examples/simple/jobset-with-network.yaml) of using a custom subdomain your JobSet's network configuration.
2121

2222
- **Configurable success policies**: JobSet has [configurable success policies](https://github.com/kubernetes-sigs/jobset/blob/v0.5.0/examples/simple/success-policy.yaml) which target specific ReplicatedJobs, with operators to target `Any` or `All` of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed. This enables users to use their compute resources more efficiently, allowing a workload to be declared successful and release the resources for the next workload more quickly.
2323

RELEASE.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
The Kubernetes Template Project is released on an as-needed basis. The process is as follows:
44

55
1. An issue is proposing a new release with a changelog since the last release
6-
1. All [OWNERS](OWNERS) must LGTM this release
7-
1. An OWNER runs `git tag -s $VERSION` and inserts the changelog and pushes the tag with `git push $VERSION`
8-
1. The release issue is closed
9-
1. An announcement email is sent to `[email protected]` with the subject `[ANNOUNCE] kubernetes-template-project $VERSION is released`
6+
2. All [OWNERS](OWNERS) must LGTM this release
7+
3. An OWNER runs `git tag -s $VERSION` and inserts the changelog and pushes the tag with `git push $VERSION`
8+
4. The release issue is closed
9+
5. An announcement email is sent to `[email protected]` with the subject `[ANNOUNCE] kubernetes-template-project $VERSION is released`
+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
apiVersion: jobset.x-k8s.io/v1alpha2
2+
kind: JobSet
3+
metadata:
4+
name: network-jobset
5+
spec:
6+
network:
7+
# this field allows pods to be reached via their hostnames
8+
# hostname: <jobSet.name>-<spec.replicatedJob.name>-<job-index>-<pod-index>.<subdomain>
9+
# example: network-jobset-leader-0-0.example
10+
enableDNSHostnames: true
11+
# subdomain is a field for a network subdomain name
12+
# defaults to <jobSet.name> if not set.
13+
subdomain: example
14+
# this field indicates if DNS records of pods should be published before the pods are ready.
15+
# default to true
16+
publishNotReadyAddresses: true
17+
replicatedJobs:
18+
- name: leader
19+
replicas: 1
20+
template:
21+
spec:
22+
backoffLimit: 0
23+
completions: 1
24+
parallelism: 1
25+
template:
26+
spec:
27+
containers:
28+
- name: leader
29+
image: bash:latest
30+
command:
31+
- bash
32+
- -xc
33+
- |
34+
sleep 3600
35+
- name: workers
36+
replicas: 1
37+
template:
38+
spec:
39+
backoffLimit: 0
40+
completions: 2
41+
parallelism: 2
42+
template:
43+
spec:
44+
containers:
45+
- name: worker
46+
image: bash:latest
47+
command:
48+
- bash
49+
- -xc
50+
- |
51+
sleep 20
52+
success_count=0
53+
for i in {1..2}; do
54+
if ping -c 1 network-jobset-leader-0-0.example; then
55+
((success_count++))
56+
fi
57+
done
58+
if [[ $success_count -eq 2 ]]; then
59+
echo "leader is up"
60+
fi
61+
while true; do sleep 3600; done

site/content/en/docs/troubleshooting/_index.md

+44
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,47 @@ Look at the JobSet controller logs and you'll probably see an error like this:
5959
**Cause**: This could be due to a known bug in an older version of JobSet, or a known bug in an older version of Kueue. JobSet and Kueue integration requires JobSet v0.2.3+ and Kueue v0.4.1+.
6060

6161
**Solution**: If you're using JobSet version less than v0.2.3, uninstall and re-install using a versoin >= v0.2.3 (see the JobSet [installation guide](https://jobset.sigs.k8s.io/docs/installation/) for the commands to do this). If you're using a Kueue version less than v0.4.1, uninstall and re-install using a v0.4.1 (see the Kueue [installation guide](https://kueue.sigs.k8s.io/docs/installation/) for the commands to do this).
62+
63+
## 4. Troubleshooting network communication between different Pods
64+
65+
**Cause**: The network communication between different Pods might be blocked by the network policy, or caused by unstable cluster environment
66+
67+
**Solution**: You can follow the following debugging steps to troubleshoot. First, you can deploy the example by running `kubectl apply -f jobset-network.yaml` [example](../../../../../site/static/examples/simple/jobset-with-network.yaml) and then check if the pods and services of the JobSet are running correctly. Also, you can use the exec command to enter the container. By checking the /etc/hosts file within the container, you can observe the presence of a domain name, such as network-jobset-leader-0-0.example This domain name allows other containers to access the current pod. Similarly, you can also utilize the domain names of other pods for network communication.
68+
```bash
69+
root@VM-0-4-ubuntu:/home/ubuntu# vi jobset-network.yaml
70+
root@VM-0-4-ubuntu:/home/ubuntu# kubectl apply -f jobset-network.yaml
71+
jobset.jobset.x-k8s.io/network-jobset created
72+
root@VM-0-4-ubuntu:/home/ubuntu# kubectl get pods
73+
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
74+
network-jobset-leader-0-0-5xnzz 1/1 Running 0 17m 10.6.2.27 cluster1-worker <none> <none>
75+
network-jobset-workers-0-0-78k9j 1/1 Running 0 17m 10.6.1.16 cluster1-worker2 <none> <none>
76+
network-jobset-workers-0-1-rmw42 1/1 Running 0 17m 10.6.2.28 cluster1-worker <none> <none>
77+
root@VM-0-4-ubuntu:/home/ubuntu# kubectl get svc
78+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
79+
example ClusterIP None <none> <none> 19s
80+
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d1h
81+
```
82+
83+
```bash
84+
root@VM-0-4-ubuntu:/home/ubuntu# kubectl exec -it network-jobset-leader-0-0-5xnzz -- sh
85+
/ # cat /etc/hosts
86+
# Kubernetes-managed hosts file.
87+
127.0.0.1 localhost
88+
...
89+
10.6.2.27 network-jobset-leader-0-0.example.default.svc.cluster.local network-jobset-leader-0-0
90+
/ # ping network-jobset-workers-0-0.example
91+
PING network-jobset-workers-0-0.example (10.6.1.16): 56 data bytes
92+
64 bytes from 10.6.1.16: seq=0 ttl=62 time=0.121 ms
93+
64 bytes from 10.6.1.16: seq=1 ttl=62 time=0.093 ms
94+
64 bytes from 10.6.1.16: seq=2 ttl=62 time=0.094 ms
95+
64 bytes from 10.6.1.16: seq=3 ttl=62 time=0.103 ms
96+
--- network-jobset-workers-0-0.example ping statistics ---
97+
4 packets transmitted, 4 packets received, 0% packet loss
98+
round-trip min/avg/max = 0.093/0.102/0.121 ms
99+
/ # ping network-jobset-workers-0-1.example
100+
PING network-jobset-workers-0-1.example (10.6.2.28): 56 data bytes
101+
64 bytes from 10.6.2.28: seq=0 ttl=63 time=0.068 ms
102+
64 bytes from 10.6.2.28: seq=1 ttl=63 time=0.072 ms
103+
64 bytes from 10.6.2.28: seq=2 ttl=63 time=0.079 ms
104+
--- network-jobset-workers-0-1.example ping statistics ---
105+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
apiVersion: jobset.x-k8s.io/v1alpha2
2+
kind: JobSet
3+
metadata:
4+
name: network-jobset
5+
spec:
6+
network:
7+
# this field allows pods to be reached via their hostnames
8+
# hostname: <jobSet.name>-<spec.replicatedJob.name>-<job-index>-<pod-index>.<subdomain>
9+
# example: network-jobset-leader-0-0.example
10+
enableDNSHostnames: true
11+
# subdomain is a field for a network subdomain name
12+
# defaults to <jobSet.name> if not set.
13+
subdomain: example
14+
# this field indicates if DNS records of pods should be published before the pods are ready.
15+
# default to true
16+
publishNotReadyAddresses: true
17+
replicatedJobs:
18+
- name: leader
19+
replicas: 1
20+
template:
21+
spec:
22+
backoffLimit: 0
23+
completions: 1
24+
parallelism: 1
25+
template:
26+
spec:
27+
containers:
28+
- name: leader
29+
image: bash:latest
30+
command:
31+
- bash
32+
- -xc
33+
- |
34+
sleep 3600
35+
- name: workers
36+
replicas: 1
37+
template:
38+
spec:
39+
backoffLimit: 0
40+
completions: 2
41+
parallelism: 2
42+
template:
43+
spec:
44+
containers:
45+
- name: worker
46+
image: bash:latest
47+
command:
48+
- bash
49+
- -xc
50+
- |
51+
sleep 20
52+
success_count=0
53+
for i in {1..2}; do
54+
if ping -c 1 network-jobset-leader-0-0.example; then
55+
((success_count++))
56+
fi
57+
done
58+
if [[ $success_count -eq 2 ]]; then
59+
echo "leader is up"
60+
fi
61+
while true; do sleep 3600; done

0 commit comments

Comments
 (0)