Skip to content

microk8s cross node communication not working #3133

@RobinJespersen

Description

@RobinJespersen

My service / pod is only reachable from the node it is executed on.


my setup

I have three fresh and identical Ubuntu 20.04.4 LTS servers, each with its own public IP address.

I installed microk8s on all nodes by running:
sudo snap install microk8s --classic

On the master node I executed
microk8s add-node
and joined the two other servers by executing
microk8s join XXX.XXX.X.XXX:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05

After that, by running kubectl get no I can see the three nodes all having the status ready.
And kubectl get all --all-namespaces shows

NAMESPACE     NAME                                          READY   STATUS    RESTARTS      AGE
kube-system   pod/calico-node-hwsvj                         1/1     Running   1 (63m ago)   72m
kube-system   pod/calico-node-zd6rc                         1/1     Running   1 (62m ago)   71m
kube-system   pod/calico-node-djkmk                         1/1     Running   1 (62m ago)   72m
kube-system   pod/calico-kube-controllers-dc44f6cdf-flj54   1/1     Running   2 (62m ago)   74m

NAMESPACE   NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   75m

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   3         3         3       3            3           kubernetes.io/os=linux   75m

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           75m

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-dc44f6cdf   1         1         1       74m

wget --no-check-certificate https://10.152.183.1/
executed on all nodes returns always

WARNING: cannot verify 10.152.183.1's certificate, issued by ‘CN=10.152.183.1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.

So far everything works as expected.


problem 1

I get the IP of calico-kube-controllers by calling kubectl describe -n=kube-system pod/calico-kube-controllers-dc44f6cdf-flj54

And executing wget https://10.1.50.194/ on the "master" node returns

Connecting to 10.1.50.194:443... failed: Connection refused.

and on the two other nodes

Connecting to 10.1.50.194:80... failed: Connection timed out.

For my understanding, the IP of the pod should be reachable from all nodes. Is that correct?


problem 2

I installed the following deployment by calling

kubectl apply -f ./deployment.yaml
kubectl apply -f ./service.yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test-deployment
  name: test-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deployment
  template:
    metadata:
      labels:
        app: test-deployment
    spec:
      containers:
      - image: dontrebootme/microbot:v1
        imagePullPolicy: IfNotPresent
        name: microbot
        resources: {}
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
# service.yaml
apiVersion: v1 
kind: Service 
metadata:
  name: test-service 
  namespace: default
spec:
  type: ClusterIP
  selector:
    app: test-deployment
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80

kubectl get all --all-namespaces

NAMESPACE     NAME                                          READY   STATUS    RESTARTS      AGE
kube-system   pod/calico-node-hwsvj                         1/1     Running   1 (91m ago)   101m
kube-system   pod/calico-node-zd6rc                         1/1     Running   1 (91m ago)   100m
kube-system   pod/calico-node-djkmk                         1/1     Running   1 (91m ago)   101m
kube-system   pod/calico-kube-controllers-dc44f6cdf-flj54   1/1     Running   2 (91m ago)   103m
default       pod/test-deployment-5899c5ff7d-d442g          1/1     Running   0             59s

NAMESPACE   NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes     ClusterIP   10.152.183.1     <none>        443/TCP   103m
default     service/test-service   ClusterIP   10.152.183.247   <none>        80/TCP    31s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   3         3         3       3            3           kubernetes.io/os=linux   103m

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           103m
default       deployment.apps/test-deployment           1/1     1            1           59s

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-dc44f6cdf   1         1         1       103m
default       replicaset.apps/test-deployment-5899c5ff7d          1         1         1       59s

Calling wget http://10.152.183.247/ on all nodes returns twice

--2022-05-06 10:34:04--  http://10.152.183.247/
Connecting to 10.152.183.247:80... failed: Connection timed out.
Retrying.

and once

<!DOCTYPE html>
<html>
  <style type="text/css">
    .centered
      {
      text-align:center;
      margin-top:0px;
      margin-bottom:0px;
      padding:0px;
      }
  </style>
  <body>
    <p class="centered"><img src="microbot.png" alt="microbot"/></p>
    <p class="centered">Container hostname: test-deployment-5899c5ff7d-d442g</p>
  </body>
</html>

For my understanding, the service of should be reachable from all nodes.
Calling wget on the ip of the pod itself shows exactly the same behavior.


workaround

Adding hostNetwork: true to the deployment makes the service accessible from all nodes, but that seems to be the wrong way of doing it.


Does anyone have an Idea how I can debug this? I am out of Ideas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions