Skip to content

SSL error when cleaning up pods #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
giordyb opened this issue Aug 13, 2019 · 12 comments
Closed

SSL error when cleaning up pods #170

giordyb opened this issue Aug 13, 2019 · 12 comments

Comments

@giordyb
Copy link

giordyb commented Aug 13, 2019

Hi,
I'm having a small issue running dask-kubernetes on a local Kubernetes 1.14.3 cluster (the one provided by the latest docker desktop: the job runs fine and I get back the correct results but it looks like there is an SSL issue when it tries to clean up the pods.
File "/usr/local/lib/python3.7/site-packages/dask_kubernetes/core.py", line 544, in _cleanup_pods
This is the error that I get when I run my script job:

root@69feaef9d947:/app# python process_receipts_dask2_kubernetes.py --testrun
/usr/local/lib/python3.7/site-packages/distributed/client.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import defaultdict, Iterator
/usr/local/lib/python3.7/site-packages/distributed/publish.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/usr/local/lib/python3.7/site-packages/distributed/scheduler.py:2: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import defaultdict, deque, OrderedDict, Mapping, Set
dasboard: {'dashboard': 8787}
                               file           data_gt  ... partita_iva_ocr partita_iva_match
0  MV_trfktqxxci_20180830070237.jpg               nan  ...     xxxxxxxxxxx              True
0  MV_tentyvsblx_20180903070230.jpg  29/08/2018 12.08  ...            None             False
0  MV_qrmiwvlyio_20180908100205.jpg               nan  ...            None             False
0  MV_jxavynnpmd_20180903070227.jpg               nan  ...            None             False
0  MV_dpaztptfio_20180908100214.jpg        06/09/2018  ...            None             False

[5 rows x 10 columns]
{'data': {'n_tot': 5, 'n_gt': 2, 'n_true_pos': 2, 'n_false_pos': 0, 'n_true_neg': 3, 'n_false_neg': 0, 'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0}, 'importo': {'n_tot': 5, 'n_gt': 5, 'n_true_pos': 1, 'n_false_pos': 0, 'n_true_neg': 0, 'n_false_neg': 4, 'precision': 1.0, 'recall': 0.2, 'f1_score': 0.33333333333333337}, 'partita_iva': {'n_tot': 5, 'n_gt': 4, 'n_true_pos': 1, 'n_false_pos': 0, 'n_true_neg': 1, 'n_false_neg': 3, 'precision': 1.0, 'recall': 0.25, 'f1_score': 0.4}}
2019-08-13 14:43:28,926 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
2019-08-13 14:43:28,926 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
2019-08-13 14:43:28,932 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
2019-08-13 14:43:28,932 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
2019-08-13 14:43:28,937 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
2019-08-13 14:43:28,937 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(FileNotFoundError(2, 'No such file or directory'))': /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 322, in ssl_wrap_socket
    context.load_verify_locations(ca_certs, ca_cert_dir)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
    ssl_context=context)
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 324, in ssl_wrap_socket
    raise SSLError(e)
urllib3.exceptions.SSLError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/weakref.py", line 648, in _exitfunc
    f()
  File "/usr/local/lib/python3.7/weakref.py", line 572, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/usr/local/lib/python3.7/site-packages/dask_kubernetes/core.py", line 544, in _cleanup_pods
    pods = api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12372, in list_namespaced_pod
    (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12472, in list_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 355, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 205, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 68, in request
    **urlopen_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 89, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 326, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.docker.internal', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=dask.org%2Fcluster-name%3Dreceipts%2Cuser%3Droot%2Capp%3Ddask%2Ccomponent%3Ddask-worker (Caused by SSLError(FileNotFoundError(2, 'No such file or directory')))

the code is running inside a container that is the same image as the ones specified in the worker-spec.yml

kind: Pod
spec:
  restartPolicy: Never
  
  containers:
  - image: sroie_app
    imagePullPolicy: IfNotPresent
    command: [/usr/local/bin/python]
    #args: [/usr/local/bin/dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
    args: [/usr/local/bin/dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 1GB, --death-timeout, '3600']
    name: dask
    resources:
      limits:
        cpu: "1"
        memory: 1G
      requests:
        cpu: "1"
        memory: 512M

it seems related to #113 but in my case the container is running inside the k8s cluster and I can reach all of the workers correctly.

Thanks,

Giordano

@TomAugspurger
Copy link
Member

cc @jacobtomlinson if you have any thoughts.

@giordyb
Copy link
Author

giordyb commented Aug 13, 2019

I forgot to mention that if I tail the log of the api-server I get this error whenever I run the script:

I0813 15:05:43.225158       1 log.go:172] http: TLS handshake error from 172.22.0.3:37774: EOF
I0813 15:05:43.226071       1 log.go:172] http: TLS handshake error from 172.22.0.3:37766: EOF
I0813 15:05:43.226750       1 log.go:172] http: TLS handshake error from 172.22.0.3:37772: EOF
I0813 15:05:43.226947       1 log.go:172] http: TLS handshake error from 172.22.0.3:37770: EOF

@jacobtomlinson
Copy link
Member

Hmm. Initial thoughts would be are the ca cert bundles installed correctly on the machine you are running the script from?

@giordyb
Copy link
Author

giordyb commented Aug 13, 2019

Hi there,

thanks for the quick reply!

This is my Dockerfile


FROM python:3.7.4-buster
RUN apt-get update
RUN apt-get install apt-transport-https apt-utils -y
RUN echo "deb https://notesalexp.org/tesseract-ocr/buster/ buster main" >> etc/apt/sources.list
#RUN echo "deb https://notesalexp.org/tesseract-ocr/tessdata_best/ buster main" >> etc/apt/sources.list
#RUN wget -O - https://notesalexp.org/debian/alexp_key.asc | sudo apt-key add -
RUN apt-get update -oAcquire::AllowInsecureRepositories=true
RUN apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true -y --allow-unauthenticated
RUN apt-get update
RUN apt-get install ghostscript imagemagick tesseract-ocr tesseract-ocr-ita tesseract-ocr-fra poppler-utils -y
RUN pip install --upgrade pip
COPY requirements.txt /
RUN pip install -r requirements.txt
RUN mkdir /app
RUN mkdir /root/.kube/
WORKDIR /app
COPY . /app
ENV REQUESTS_CA_BUNDLE=/etc/ssl/certs/
RUN curl -o /usr/local/bin/kubectl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl 
RUN chmod a+x /usr/local/bin/kubectl
COPY ca.crt /usr/local/share/ca-certificates
RUN update-ca-certificates
COPY config /root/.kube/
ENTRYPOINT [ "bash" ]

If I look in the /etc/ssl/certs I see all of the certificates, as you can see from the Dockerfile I also tried to add the cluster's CA to the bundle but I am still getting the same error.
Also, If it were an issue with the certificates shouldn't I get an error when the worker containers are created?

@giordyb
Copy link
Author

giordyb commented Aug 13, 2019

I've done some more troubleshooting:

If I extract the certificates from by kubectl config and use the KubeAuth object then it works without a hitch (and the pods get cleaned up correctly):

kauth = KubeAuth(
        host="https://kubernetes.docker.internal:6443",
        verify_ssl=True,
        ssl_ca_cert="k8sca.crt",
        cert_file="k8scert.crt",
        key_file="k8skey.crt",
    )
cluster = KubeCluster.from_yaml(
        "worker-spec.yml", auth=[kauth]
    )

if I use the kubectl config file then it throws the SSL error at the end and the pods do not get cleaned up

kconfig = KubeConfig(
       config_file="/root/.kube/config", context=None, persist_config=True
   )
cluster = KubeCluster.from_yaml(
       "worker-spec.yml", auth=[kconfig]
)

shouldn't it behave the same way?

@giordyb
Copy link
Author

giordyb commented Aug 13, 2019

I think I might have found the issue:

the _cleanup_pods function () does not setup the authentication the same way the _init function does

def _cleanup_pods(namespace, labels):
""" Remove all pods with these labels in this namespace """
api = kubernetes.client.CoreV1Api()
pods = api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
for pod in pods.items:
try:
api.delete_namespaced_pod(pod.metadata.name, namespace)
logger.info("Deleted pod: %s", pod.metadata.name)
except kubernetes.client.rest.ApiException as e:
# ignore error if pod is already removed
if e.status != 404:
raise

it looks like it's missing a call to ClusterAuth.load_first()

ClusterAuth.load_first(auth)
self.core_api = kubernetes.client.CoreV1Api()

I tried to add in my environment a ClusterAuth.load_first() line before the kubernetes.client.CoreV1Api() and now it works but I am not sure it's the correct fix because to be coherent with the rest of the code it would need to be passed the "auth" variable as well.
Does anybody have any suggestions?

@jacobtomlinson
Copy link
Member

Thanks for debugging this. Yeah it looks like some of the auth flow is missing there. It might be more sensible to run ClusterAuth.load_first(ClusterAuth.DEFAULT) for now and I'll take a bigger look at this as part of #162

Do you feel up to putting in a PR for this?

@giordyb
Copy link
Author

giordyb commented Aug 15, 2019

Hi @jacobtomlinson I just submitted the pull request. A suggestion since you are going to rewrite this: I think it would be useful to have the pod deletion optional since for debugging purposes it would be useful to be able to look at the pod's output.

@jacobtomlinson
Copy link
Member

You can already do this with cluster.logs().

@giordyb
Copy link
Author

giordyb commented Aug 16, 2019

@jacobtomlinson thanks, it worked like a charm, I guess I should have rtfm 😃

@csala
Copy link

csala commented May 3, 2020

I'm hitting this error again when running with the following versions:

dask-kubernetes==0.10.1
kubernetes==11.0.0
kubernetes-asyncio==11.2.0

Applying a change similar to what @giordyb suggested on #172 makes the problem go away:

  1. Pass self.core_api down to self._cleanup_resources when calling finalize instead of creating a new CoreV1Api instance.
  2. Add a yield statement to the calls that get the pods and services lists:
def _cleanup_resources(namespace, labels, core_api):                                          
    """ Remove all pods with these labels in this namespace """                               
                                                                                              
    pods = yield core_api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
    ...                                                                
                                                                                              
    services = yield core_api.list_namespaced_service(                                        
        namespace, label_selector=format_labels(labels)                                       
    )                                                                                         

I'll happily open a new Issue or a PR if requested.

@jacobtomlinson
Copy link
Member

I'll happily open a new Issue or a PR if requested.

A PR would be great thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants