-
Notifications
You must be signed in to change notification settings - Fork 51
ci(aks): Training-operator UAT fails on AKS k8s 1.28 #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5650.
|
Did some exploration here and those are the logs when running the same thing locally Describe worker pod ╰─$ kdp -n test-kubeflow pytorch-dist-mnist-gloo-worker-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m44s default-scheduler Successfully assigned test-kubeflow/pytorch-dist-mnist-gloo-worker-0 to aks-nodepool1-16255669-vmss000000
Normal Pulling 4m43s kubelet Pulling image "alpine:3.10"
Normal Pulled 4m41s kubelet Successfully pulled image "alpine:3.10" in 2.24s (2.24s including waiting)
Normal Created 4m41s kubelet Created container init-pytorch
Normal Started 4m41s kubelet Started container init-pytorch
Normal Pulling 3m38s kubelet Pulling image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0"
Normal Pulled 2m55s kubelet Successfully pulled image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" in 42.742s (42.742s including waiting)
Normal Created 2m1s (x4 over 2m55s) kubelet Created container pytorch
Normal Started 2m1s (x4 over 2m55s) kubelet Started container pytorch
Warning BackOff 79s (x7 over 2m39s) kubelet Back-off restarting failed container pytorch in pod pytorch-dist-mnist-gloo-worker-0_test-kubeflow(ff109a33-1fa8-47c4-be31-3aff036c64b9)
Normal Pulled 68s (x4 over 2m41s) kubelet Container image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" already present on machine Describe master pod ╰─$ kdp -n test-kubeflow pytorch-dist-mnist-gloo-master-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m58s default-scheduler Successfully assigned test-kubeflow/pytorch-dist-mnist-gloo-master-0 to aks-nodepool1-16255669-vmss000001
Normal Pulling 6m57s kubelet Pulling image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0"
Normal Pulled 6m14s kubelet Successfully pulled image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" in 43.869s (43.869s including waiting)
Normal Created 3m20s (x5 over 6m14s) kubelet Created container pytorch
Normal Pulled 3m20s (x4 over 5m7s) kubelet Container image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" already present on machine
Normal Started 3m19s (x5 over 6m13s) kubelet Started container pytorch
Warning BackOff 84s (x10 over 4m53s) kubelet Back-off restarting failed container pytorch in pod pytorch-dist-mnist-gloo-master-0_test-kubeflow(6ca00177-a5f1-4db2-83f1-344176c40481) Logs from worker pod ╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
File "/var/mnist.py", line 150, in <module>
main()
File "/var/mnist.py", line 123, in main
transforms.Normalize((0.1307,), (0.3081,))
File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
epoch, batch_idx * len(data), len(train_loader.dataset),
File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
if should_distribute():
File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
at a differnt point in time:
╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Traceback (most recent call last):
File "/var/mnist.py", line 150, in <module>
main()
File "/var/mnist.py", line 116, in main
dist.init_process_group(backend=args.backend)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, start_daemon)
ValueError: host not found: Name or service not known
# from init container
# should be totally irrelevant since it succeeds in the end
╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0 -c init-pytorch
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'pytorch-dist-mnist-gloo-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve
...
nslookup: can't resolve 'pytorch-dist-mnist-gloo-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve
Name: pytorch-dist-mnist-gloo-master-0
Address 1: 10.244.1.54 10-244-1-54.pytorch-dist-mnist-gloo-master-0.test-kubeflow.svc.cluster.local Logs from master pod. I think the same message is just propagated from the worker. ╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-master-0
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
File "/var/mnist.py", line 150, in <module>
main()
File "/var/mnist.py", line 123, in main
transforms.Normalize((0.1307,), (0.3081,))
File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
epoch, batch_idx * len(data), len(train_loader.dataset),
File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
if should_distribute():
File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden |
@orfeas-k looks like the example (from upstream?) is not working. The 403 is because the file that it tries to download returns 403 http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz So we'll need to use a different example. Something similar had happened with Katib in the past canonical/charmed-kubeflow-uats#64 |
Looks like the issue comes from using an out-of-date image. Upstream faced a similar problem kubeflow/trainer#2083 and they updated the image
|
Bug Description
Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
. This is the case both for CKFlatest/edge
and1.8/stable
. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run canonical/charmed-kubeflow-uats#4.Example runs
To Reproduce
Run CI for k8s version 1.28
Environment
AKS k8s 1.28
Juju 3.1
for 1.8 juju status
for latest/edge juju status
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: