You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Take a look at the [concepts](https://jobset.sigs.k8s.io/docs/concepts/) page fo
17
17
18
18
-**Support for multi-template jobs**: JobSet models a distributed training workload as a group of K8s Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.), something which cannot be done by a single Job.
19
19
20
-
-**Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames.
20
+
-**Automatic headless service configuration and lifecycle management**: ML and HPC frameworks require a stable network endpoint for each worker in the distributed workload, and since pod IPs are dynamically assigned and can change between restarts, stable pod hostnames are required for distributed training on k8s, By default, JobSet uses [IndexedJobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/) to establish stable pod hostnames, and does automatic configuration and lifecycle management of the headless service to trigger DNS record creations and establish network connectivity via pod hostnames. These networking configurations are defaulted automatically to enable stable network endpoints and pod-to-pod communication via hostnames; however, they can be customized in the JobSet spec: see this [example](examples/simple/jobset-with-network.yaml) of using a custom subdomain your JobSet's network configuration.
21
21
22
22
-**Configurable success policies**: JobSet has [configurable success policies](https://github.com/kubernetes-sigs/jobset/blob/v0.5.0/examples/simple/success-policy.yaml) which target specific ReplicatedJobs, with operators to target `Any` or `All` of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed. This enables users to use their compute resources more efficiently, allowing a workload to be declared successful and release the resources for the next workload more quickly.
Copy file name to clipboardExpand all lines: site/content/en/docs/troubleshooting/_index.md
+44
Original file line number
Diff line number
Diff line change
@@ -59,3 +59,47 @@ Look at the JobSet controller logs and you'll probably see an error like this:
59
59
**Cause**: This could be due to a known bug in an older version of JobSet, or a known bug in an older version of Kueue. JobSet and Kueue integration requires JobSet v0.2.3+ and Kueue v0.4.1+.
60
60
61
61
**Solution**: If you're using JobSet version less than v0.2.3, uninstall and re-install using a versoin >= v0.2.3 (see the JobSet [installation guide](https://jobset.sigs.k8s.io/docs/installation/) for the commands to do this). If you're using a Kueue version less than v0.4.1, uninstall and re-install using a v0.4.1 (see the Kueue [installation guide](https://kueue.sigs.k8s.io/docs/installation/) for the commands to do this).
62
+
63
+
## 4. Troubleshooting network communication between different Pods
64
+
65
+
**Cause**: The network communication between different Pods might be blocked by the network policy, or caused by unstable cluster environment
66
+
67
+
**Solution**: You can follow the following debugging steps to troubleshoot. First, you can deploy the example by running `kubectl apply -f jobset-network.yaml`[example](../../../../../site/static/examples/simple/jobset-with-network.yaml) and then check if the pods and services of the JobSet are running correctly. Also, you can use the exec command to enter the container. By checking the /etc/hosts file within the container, you can observe the presence of a domain name, such as network-jobset-leader-0-0.example This domain name allows other containers to access the current pod. Similarly, you can also utilize the domain names of other pods for network communication.
68
+
```bash
69
+
root@VM-0-4-ubuntu:/home/ubuntu# vi jobset-network.yaml
0 commit comments