Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication issue with VXLAN if traffic needs to be redirected to the node where the pod is running #10149

Closed
joseluisgonzalezca opened this issue Apr 4, 2025 · 7 comments

Comments

@joseluisgonzalezca
Copy link

Expected Behavior

According to Kubernetes design, NodePort services must be "available" on all nodes of the cluster, regardless of whether the target pods of the service are running on those nodes. When introducing external traffic to the cluster through a loadbalancer, requests will be distributed among the different nodes. If the node to which the request is directed does not have the required pod, it must be able to redirect the traffic correctly.

Current Behavior

Service exposed externally only works if the targeting pod is running in the same node where the original was sent to. In other cases, a timeout occurs and the workload cannot be accessed.

Possible Solution

Other issues like #9433, #9985 or #8860 suggest disabling checksum verification. Some people confirmed that this workaround was sufficient to resolve their problem. For us, it was not the case. It is possible that the issue comes from another source and it's not directly related. However, the symptoms are very similar to other recent issues.

Steps to Reproduce (for bugs)

  1. Create a Kubernetes cluster with multiple nodes
  2. Install Calico 3.29.1 through operator
  3. Create a simple deployment and expose it via Loadbalancer/Nodeport service. In my case, I've deployed ingress-nginx-controller with its Helm chart in order to facilitate the process.
  4. Try to access the service by making curl requests to the Loadbalancer IP. Some of the requests will succeed, while others will fail by receiving a timeout or connection reset.

Context

We have a test environment that is practically equivalent to the one that is causing us problems. The are only two main differences:

  • Some of the test cluster nodes have been upgraded from Debian 11 to Debian 12 (not fresh install).
  • Production cluster have some NetworkPolicy objects, but are binded to other namespaces so there should not affect to this scenario. If required, I could provide them.

The procedures performed on both clusters are the same to make the comparison as accurate as possible in order to determine the real source of the problem.

MTU

First of all, there was an MTU configuration problem. Interfaces were not being correctly auto-detected because of the mtuIfacePattern default value. I have changed the configuration to match Debian 12 interface name pattern (check #10148) and now all is working as expected. Here are the logs of one of the calico nodes:

calico-node 2025-04-03 13:53:24.352 [INFO][9] startup/startup.go 444: Checking datastore connection
calico-node 2025-04-03 13:53:24.356 [INFO][9] startup/startup.go 468: Datastore connection verified
calico-node 2025-04-03 13:53:24.357 [INFO][9] startup/startup.go 105: Datastore is ready
calico-node 2025-04-03 13:53:24.363 [WARNING][9] startup/winutils.go 150: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This mi
ght not work.
calico-node 2025-04-03 13:53:24.370 [INFO][9] startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface enX2: 172.31.6.37/24
calico-node 2025-04-03 13:53:24.370 [INFO][9] startup/startup.go 720: No AS number configured on node resource, using global value
calico-node 2025-04-03 13:53:24.370 [INFO][9] startup/startup.go 755: Skipping IP pool configuration

Checksum verification

As stated in other issues, there is a bug in the kernel and checksum generation for packets is not correctly made, causing problems with its validation. By executing a basic tcpdump with -vv flag, we can see that errors are present:

10:13:58.257468 enX1  Out IP (tos 0x0, ttl 64, id 3548, offset 0, flags [DF], proto TCP (6), length 76)
    192.168.132.11.9345 > 192.168.132.13.53972: Flags [P.], cksum 0x89a8 (incorrect -> 0x8c4c), seq 1:25, ack 28, win 6373, options [nop,nop,TS val 3372686346 ecr 3893990312], length 24
10:13:58.257601 enX1  In  IP (tos 0x0, ttl 64, id 64572, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.132.13.53972 > 192.168.132.11.9345: Flags [.], cksum 0x8990 (incorrect -> 0xf4b9), seq 28, ack 25, win 1593, options [nop,nop,TS val 3893990312 ecr 3372686346], length 0

As a workaround, some people suggested disabling it. I have done it if all my nodes by executing the following command in my node interfaces (public interface enX0, private interface enX1 for inter-node communication and vxlan.calico):

ethtool -K enX0 tx off rx off && ethtool -K enX1 tx off rx off && ethtool -K vxlan.calico tx off rx off

However, the problem persists.

Iptables configuration

As we all know, traffic is routed through services by creating rules in iptables. Because of that, it's essential that the service is correctly configured and that all rules are present. Here are the rules created in one of my nodes:

PREROUTING chain
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
6258K  470M cali-PREROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:6gwbT8clXdHdC1b1 */
6258K  470M KUBE-SERVICES  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
1245K   75M DOCKER     0    --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
1035K   62M CNI-HOSTPORT-DNAT  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
KUBE-SERVICES chain
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination

<< TRUNCATED >>

    0     0 KUBE-SVC-RXZQBFX6IWO22WWW  6    --  *      *       0.0.0.0/0            10.43.227.77         /* cattle-system/cattle-cluster-agent:http cluster IP */ tcp dpt:80
    0     0 KUBE-SVC-DISNXZXWEI7GIGLU  6    --  *      *       0.0.0.0/0            10.43.227.77         /* cattle-system/cattle-cluster-agent:https-internal cluster IP */ tcp dpt:443
    0     0 KUBE-SVC-EZYNCFY2F7N6OQA2  6    --  *      *       0.0.0.0/0            10.43.46.84          /* ingress-nginx/ingress-nginx-controller-admission:https-webhook cluster IP */ tcp dpt:443
    0     0 KUBE-SVC-RK657RLKDNVNU64O  6    --  *      *       0.0.0.0/0            10.43.28.198         /* calico-system/calico-typha:calico-typha cluster IP */ tcp dpt:5473
    0     0 KUBE-SVC-PUNXDRXNIM3ELMDM  6    --  *      *       0.0.0.0/0            10.43.0.10           /* kube-system/rke2-coredns-rke2-coredns:tcp-53 cluster IP */ tcp dpt:53
    0     0 KUBE-SVC-5IVUE5OK3QWPEUWR  6    --  *      *       0.0.0.0/0            10.43.184.88         /* kube-system/csi-rbdplugin-provisioner:http-metrics cluster IP */ tcp dpt:8080
    0     0 KUBE-SVC-MTMD5TM6ML5IOAJF  6    --  *      *       0.0.0.0/0            10.43.155.164        /* satse-dev-liferays-els/elasticsearch-v7176:http cluster IP */ tcp dpt:9200
    0     0 KUBE-SVC-VOOBF6UTRBMKQJOO  6    --  *      *       0.0.0.0/0            10.43.230.222        /* ingress-nginx/ingress-nginx-controller-metrics:metrics cluster IP */ tcp dpt:10254
    0     0 KUBE-SVC-EDNDUDH2C75GIR6O  6    --  *      *       0.0.0.0/0            10.43.46.167         /* ingress-nginx/ingress-nginx-controller:https cluster IP */ tcp dpt:443
    0     0 KUBE-EXT-EDNDUDH2C75GIR6O  6    --  *      *       0.0.0.0/0            <REDACTED>        /* ingress-nginx/ingress-nginx-controller:https loadbalancer IP */ tcp dpt:443
11357  681K KUBE-NODEPORTS  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
KUBE-NODEPORTS chain
Chain KUBE-NODEPORTS (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-EXT-EDNDUDH2C75GIR6O  6    --  *      *       0.0.0.0/0            127.0.0.0/8          /* ingress-nginx/ingress-nginx-controller:https */ tcp dpt:31526 nfacct-name  localhost_nps_accepted_pkts
  349 20940 KUBE-EXT-EDNDUDH2C75GIR6O  6    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https */ tcp dpt:31526
    0     0 KUBE-EXT-CG5I4G2RS3ZVWGLK  6    --  *      *       0.0.0.0/0            127.0.0.0/8          /* ingress-nginx/ingress-nginx-controller:http */ tcp dpt:31850 nfacct-name  localhost_nps_accepted_pkts
  356 21360 KUBE-EXT-CG5I4G2RS3ZVWGLK  6    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:http */ tcp dpt:31850
KUBE-EXT-EDNDUDH2C75GIR6O chain
Chain KUBE-EXT-EDNDUDH2C75GIR6O (3 references)
 pkts bytes target     prot opt in     out     source               destination
 311K   19M KUBE-MARK-MASQ  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* masquerade traffic for ingress-nginx/ingress-nginx-controller:https external destinations */
 311K   19M KUBE-SVC-EDNDUDH2C75GIR6O  0    --  *      *       0.0.0.0/0            0.0.0.0/0
KUBE-SVC-EDNDUDH2C75GIR6O chain
Chain KUBE-SVC-EDNDUDH2C75GIR6O (2 references)
 pkts bytes target     prot opt in     out     source               destination
    3   180 KUBE-MARK-MASQ  6    --  *      *      !10.42.0.0/16         10.43.46.167         /* ingress-nginx/ingress-nginx-controller:https cluster IP */ tcp dpt:443
44576 2675K KUBE-SEP-BLCLVRWVHOKIRQOA  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.108.238:443 */ statistic mode random probability 0.14285714272
44564 2674K KUBE-SEP-ZLRUIONMKUZN5XYX  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.187.149:443 */ statistic mode random probability 0.16666666651
44722 2683K KUBE-SEP-7HYW7TC66TTC4C6S  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.188.108:443 */ statistic mode random probability 0.20000000019
44231 2654K KUBE-SEP-SSMDSV6FPXXFJJ64  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.251.205:443 */ statistic mode random probability 0.25000000000
44490 2669K KUBE-SEP-XQQKLYDYOERHD5NO  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.33.211:443 */ statistic mode random probability 0.33333333349
43998 2640K KUBE-SEP-UNJAAAXIAN2HFT2Y  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.37.24:443 */ statistic mode random probability 0.50000000000
44786 2687K KUBE-SEP-7AH4VAWXVKCTZHTE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https -> 10.42.72.104:443 */
Specific pod chain (the others are equivalent)
Chain KUBE-SEP-UNJAAAXIAN2HFT2Y (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  0    --  *      *       10.42.37.24          0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https */
44004 2640K DNAT       6    --  *      *       0.0.0.0/0            0.0.0.0/0            /* ingress-nginx/ingress-nginx-controller:https */ tcp to:10.42.37.24:443
OUTPUT and POSTROUTING
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
8998K  571M cali-OUTPUT  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:tVnHkvAo15HuiPy0 */
8999K  571M KUBE-SERVICES  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
   39  2436 DOCKER     0    --  *      *       0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL
4269K  256M CNI-HOSTPORT-DNAT  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
  12M  825M CNI-HOSTPORT-MASQ  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd requiring masquerade */
  13M  921M KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
    0     0 MASQUERADE  0    --  *      !docker0  10.10.0.0/24         0.0.0.0/0
  12M  860M cali-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:0i8pjzKKPyA34aQD */
Other output and postrouting related chains
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
20361 1609K RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
  370 22200 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
  370 22200 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Chain cali-OUTPUT (1 references)
 pkts bytes target     prot opt in     out     source               destination
8998K  571M cali-fip-dnat  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:GBTAv2p5CwevEyJm */

Chain cali-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
  12M  860M cali-fip-snat  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:Z-c7XtVd2Bq7s_hA */
  12M  860M cali-nat-outgoing  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:nYKhEzDlr11Jccal */
    0     0 MASQUERADE  0    --  *      vxlan.calico  0.0.0.0/0            0.0.0.0/0            /* cali:e9dnSgSVNmIcpVhP */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully

Chain cali-nat-outgoing (1 references)
 pkts bytes target     prot opt in     out     source               destination
 334K   22M MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

As you can see, everything looks normal. I have compared this rules with the testing environment which is currently working and they are the same.

IPs and routes of one of the nodes

There are routes for the other nodes across the VXLAN tunnels so I think this configuration is also correct. The only thing that it's strange is that the 'vxlan.calico` interface has the UNKNOWN state. Is this normal?

# ip route
default via 172.18.1.1 dev enX0
10.10.0.0/24 dev docker0 proto kernel scope link src 10.10.0.1 linkdown
10.42.33.192/26 via 10.42.33.192 dev vxlan.calico onlink
blackhole 10.42.37.0/26 proto 80
10.42.37.2 dev cali19e2a7b0ea5 scope link
10.42.37.5 dev calic6b6fe002a0 scope link
10.42.37.9 dev cali71d30acceb5 scope link
10.42.37.11 dev calibd4771adc45 scope link
10.42.37.14 dev cali14f95abeeaf scope link
10.42.37.15 dev cali80acc557fa1 scope link
10.42.37.16 dev calid99b3958f38 scope link
10.42.37.19 dev cali717c5e72bad scope link
10.42.37.20 dev cali9426989e871 scope link
10.42.37.24 dev cali199ad01a428 scope link
10.42.72.64/26 via 10.42.72.64 dev vxlan.calico onlink
10.42.108.192/26 via 10.42.108.192 dev vxlan.calico onlink
10.42.187.128/26 via 10.42.187.128 dev vxlan.calico onlink
10.42.188.64/26 via 10.42.188.64 dev vxlan.calico onlink
10.42.251.192/26 via 10.42.251.192 dev vxlan.calico onlink
172.18.1.0/24 dev enX0 proto kernel scope link src 172.18.1.11
172.31.6.0/24 dev enX2 proto kernel scope link src 172.31.6.32
192.168.132.0/24 dev enX1 proto kernel scope link src 192.168.132.11
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enX0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether e6:25:53:5b:03:be brd ff:ff:ff:ff:ff:ff
    inet 172.18.1.11/24 brd 172.18.1.255 scope global dynamic enX0
       valid_lft 19042sec preferred_lft 19042sec
3: enX1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ee:96:46:86:ca:e6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.132.11/24 brd 192.168.132.255 scope global dynamic enX1
       valid_lft 19005sec preferred_lft 19005sec
4: enX2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 5e:05:b9:e1:ae:10 brd ff:ff:ff:ff:ff:ff
    inet 172.31.6.32/24 brd 172.31.6.255 scope global dynamic enX2
       valid_lft 18971sec preferred_lft 18971sec
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:d1:5e:a6:ac brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.1/24 brd 10.10.0.255 scope global docker0
       valid_lft forever preferred_lft forever
8: cali19e2a7b0ea5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-fa813f6f-a12f-854b-ef03-c796dccdc712
11: calic6b6fe002a0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-ec76a588-a4cb-6d28-059b-1d724ad4692e
18: cali71d30acceb5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-a8fac0de-17f7-1490-b3aa-798364e94cc2
20: calibd4771adc45@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-6dc2a559-ce9e-c528-0bda-eeba0006c885
23: cali14f95abeeaf@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-3f7964e3-868e-d960-840b-9df98639d404
24: cali80acc557fa1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-ec7e1c75-ba2f-7412-edd7-a5d10cc071b9
25: calid99b3958f38@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-4dcec8d8-a6f7-0f87-dbdd-0911eb11851c
28: cali717c5e72bad@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-e8282c46-68a4-2e25-7018-f93fec6d84de
29: cali9426989e871@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-ac6e03eb-edf3-bbd2-faa2-e24527f2fdfa
33: cali199ad01a428@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-fd26e75c-4375-88bf-fe42-b13b461aa1c4
49: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 66:b4:ff:a2:15:db brd ff:ff:ff:ff:ff:ff
    inet 10.42.37.0/32 scope global vxlan.calico
       valid_lft forever preferred_lft forever

Conntrack status

This point is the only one where we have detected a different behavior between the faulty cluster and the test cluster. In the case of the test cluster (the working one), we can see that the NAT translation is done with the IP of the VXLAN interface:

ipv4     2 tcp      6 86389 ESTABLISHED src=192.168.79.5 dst=192.168.79.6 sport=51486 dport=31466 src=10.42.94.78 dst=10.42.77.192 sport=443 dport=19428 [ASSURED] mark=0 zone=0 use=2

However, in the case of the faulty cluster, the translation is done with the public interface IP:

ipv4     2 tcp      6 86396 ESTABLISHED src=192.168.132.102 dst=192.168.132.11 sport=42576 dport=31526 src=10.42.37.24 dst=172.18.1.11 sport=443 dport=47270 [ASSURED] mark=0 zone=0 use=2

Your Environment

  • Calico version: 3.29.1
  • Calico dataplane: iptables
  • Orchestrator version: Kubernetes v1.31.5 provided through RKE2 distribution
  • Operating System and version: All nodes are running Debian12 with 6.1.124-1 version of the Linux kernel
  • Node interfaces: enX0 as public interface, enX1 as internal interface, enX2 as storage network interface

If you need any more information about our environment, we are happy to provide it. I hope this context is sufficient to help find the issue with our cluster.

Thanks in advance for your time and help!

@tomastigera
Copy link
Contributor

What is your vxlan setting. Do you use the default VXLANCrossSubnet?

Does plain pod-pod within your cluster work across vxlan? If not, do you observe, as in some of the other linked cases, that icmp makes it through?

I am a little puzzled by working vs non-working cluster. Are you saying that you have pretty much identical clusters and one work well and the other does not? Are you sure that in the working cluster vxlan is on the path, that is, forwarding happens between nodes in different subnets in case you use the default VXLANCrossSubnet.

10:13:58.257468 enX1  Out IP (tos 0x0, ttl 64, id 3548, offset 0, flags [DF], proto TCP (6), length 76)
    192.168.132.11.9345 > 192.168.132.13.53972: Flags [P.], cksum 0x89a8 (incorrect -> 0x8c4c), seq 1:25, ack 28, win 6373, options [nop,nop,TS val 3372686346 ecr 3893990312], length 24

The fact that (udp) csum is incorrect on egress is expected as it is meant to be fixed by the offload to the device. Are you able to see the packet at its destination node? Does it still have wrong csum? If yes, then the packet would be dropped.

However, in the case of the faulty cluster, the translation is done with the public interface IP:

ipv4 2 tcp 6 86396 ESTABLISHED src=192.168.132.102 dst=192.168.132.11 sport=42576 dport=31526 src=10.42.37.24 dst=172.18.1.11 sport=443 dport=47270 [ASSURED] mark=0 zone=0 use=2

That seems like in one case the forwarding goes via vxlan and in the other one it does not.

Try to use tcp dump on -i any or on the enX1 and vxlan.calico to see which route the packets take. Try to find the first place where you would expect to see the packet but you do not see it.

@tomastigera
Copy link
Contributor

Could you also provide ip -d link show dev vxlan.calico. It should should the source is enX1 IP. With this many device, the vxlan one may be associated accidentally with a wrong one.

@joseluisgonzalezca
Copy link
Author

joseluisgonzalezca commented Apr 7, 2025

Hi @tomastigera!

Thanks for your quick response. I will provide some more info about our setup in case it might help.

What is your vxlan setting. Do you use the default VXLANCrossSubnet?

For the IPPool resource configuration we are using the default values provided by Rancher. This is the current definition:

apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  creationTimestamp: "2025-03-06T14:38:13Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: tigera-operator
  name: default-ipv4-ippool
  resourceVersion: "869"
  uid: fdf33413-2adf-4ef7-98c5-a11213ed988a
spec:
  allowedUses:
  - Workload
  - Tunnel
  blockSize: 26
  cidr: 10.42.0.0/16
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Always

Does plain pod-pod within your cluster work across vxlan? If not, do you observe, as in some of the other linked cases, that icmp makes it through?

Yeah! Inter-pod communication is working as expected. For example, I have these replicas for my nginx-ingress app:

% kubectl get pods -A -o wide | grep ingress
ingress-nginx                ingress-nginx-controller-4bs6n                          1/1     Running     0               11d     10.42.72.104     hou9856623   <none>           <none>
ingress-nginx                ingress-nginx-controller-4zqhj                          1/1     Running     0               11d     10.42.33.211     hou7173026   <none>           <none>
ingress-nginx                ingress-nginx-controller-6t42g                          1/1     Running     0               11d     10.42.251.205    hou3195987   <none>           <none>

I can enter one of those pods and ping/curl the others running in other nodes without problem:

% kubectl exec -it ingress-nginx-controller-4zqhj -n ingress-nginx -- sh
/etc/nginx $ ping 10.42.72.104
PING 10.42.72.104 (10.42.72.104): 56 data bytes
64 bytes from 10.42.72.104: seq=0 ttl=42 time=0.422 ms
64 bytes from 10.42.72.104: seq=1 ttl=42 time=0.301 ms
64 bytes from 10.42.72.104: seq=2 ttl=42 time=0.323 ms
^C
--- 10.42.72.104 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.301/0.348/0.422 ms
/etc/nginx $ curl https://10.42.72.104
curl: (60) SSL certificate problem: self-signed certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the webpage mentioned above.
/etc/nginx $

I am a little puzzled by working vs non-working cluster. Are you saying that you have pretty much identical clusters and one work well and the other does not? Are you sure that in the working cluster vxlan is on the path, that is, forwarding happens between nodes in different subnets in case you use the default VXLANCrossSubnet.

Yes. What I was referring to is that we have two similar clusters. The versions of K8S, Calico and the underlying OS are the same. The deployment for Calico provided by Rancher is not modified in both cases. In the test environment, all services are working as expected. In the other one, the traffic is not working correctly in some cases. That's what bothers me the most...

Could you also provide ip -d link show dev vxlan.calico. It should should the source is enX1 IP. With this many device, the vxlan one may be associated accidentally with a wrong one.

I can see that something is wrong here. If I execute the command, I see that the virtual interface is binded to enX2 (the last one used for the storage network). It should be enX1!

# ip -d link show dev vxlan.calico
49: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 66:b4:ff:a2:15:db brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535
    vxlan id 4096 local 172.31.6.32 dev enX2 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536

# ip a
<TRUNCATED>
       valid_lft forever preferred_lft forever
2: enX0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether e6:25:53:5b:03:be brd ff:ff:ff:ff:ff:ff
    inet 172.18.1.11/24 brd 172.18.1.255 scope global dynamic enX0
       valid_lft 70442sec preferred_lft 70442sec
3: enX1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ee:96:46:86:ca:e6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.132.11/24 brd 192.168.132.255 scope global dynamic enX1
       valid_lft 70413sec preferred_lft 70413sec
4: enX2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 5e:05:b9:e1:ae:10 brd ff:ff:ff:ff:ff:ff
    inet 172.31.6.32/24 brd 172.31.6.255 scope global dynamic enX2
       valid_lft 70397sec preferred_lft 70397sec

Is there a way to force Calico to use an specific interface?

That seems like in one case the forwarding goes via vxlan and in the other one it does not.

Is this behaviour normal? I understand that, with this configuration of Calico, all should be routed through the vxlan interface. Correct me if I'm wrong.

Try to use tcp dump on -i any or on the enX1 and vxlan.calico to see which route the packets take. Try to find the first place where you would expect to see the packet but you do not see it.

As you can see in the following screenshot, 3 first TCP SYN packets are received from the LB but there are not routed anywhere. However, the 4th one is acknowledged correctly because, "luckily", it's routed to a pod (10.42.37.24) that is running in the node that received the request.
Image

What I don't understand is why it's using the IP of the first interface (enX0) as source. In the other cases, I cannot see any additional traffic apart from the original SYN packets received from the LB.

BTW, this is the command that I have used to sniff the traffic (removing traffic generated by kubernetes/rancher to obtain a cleaner output):

tcpdump -i any not port 22 and not port 6443 and not port 2379 and not port 2380 and not port 9345 and not port 10250 and not port 53 and not port 2381 -w calico-debug.pcap

PCAP file => calico-debug.pcap.zip

Thanks for you help :)

@joseluisgonzalezca
Copy link
Author

More debug information. In the case of the working cluster, the FORWARD chain looks like this:

Chain FORWARD (policy DROP 354 packets, 44842 bytes)
 pkts bytes target     prot opt in     out     source               destination
 113M   42G cali-FORWARD  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:wUHhoiAYhphO9Mso */
6462K  389M KUBE-PROXY-FIREWALL  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes load balancer firewall */
7777K  460M KUBE-FORWARD  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */
6115K  369M KUBE-SERVICES  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes service portals */
6115K  369M KUBE-EXTERNAL-SERVICES  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes externally-visible service portals */
6115K  369M DOCKER-USER  0    --  *      *       0.0.0.0/0            0.0.0.0/0
6115K  369M DOCKER-ISOLATION-STAGE-1  0    --  *      *       0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     0    --  *      docker0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
    0     0 DOCKER     0    --  *      docker0  0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     0    --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     0    --  docker0 docker0  0.0.0.0/0            0.0.0.0/0
6115K  369M ACCEPT     0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:S93hcgKJrXEqnTfs */ /* Policy explicitly accepted packet. */ mark match 0x10000/0x10000
  354 44842 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:mp77cMpurHhyjLrM */ MARK or 0x100

However, the other one looks "emptier":

Chain FORWARD (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
  93M   46G cali-FORWARD  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:wUHhoiAYhphO9Mso */
2747K  276M ACCEPT     0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:S93hcgKJrXEqnTfs */ /* Policy explicitly accepted packet. */ mark match 0x10000/0x10000
 841K   50M MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:mp77cMpurHhyjLrM */ MARK or 0x10000

Could this be the cause?

@tomastigera
Copy link
Contributor

There seem to be missing kube-proxy chains in the non-working cluster. Can you reach pod-pod via a service? Does your kube-proxy work?

You should see each SYN that is being forwarded 2x (perhaps with NATed source), inbound via the enX0 and outbound via vxlan.calico - like you can see that with the successful connection to the local pod (outbound via the pod's iface). It does not seem like the packets get forwarded to any pod.

I can see that something is wrong here. If I execute the command, I see that the virtual interface is binded to enX2 (the last one used for the storage network). It should be enX1!

Yes, it should be enX1. However, it does not seem to be an issue for pod-pod over vxlan.

If you were tcpdumping for udp port 4789 (vxlan) you would likely see vxlan packets leaving over enX1 with enX2 source ip. They are likely being dropped at the destination node because of unexpected source IP.

Check which IP/device is selected and how is the IP autodetection method configured.

calicoctl get node <node-name> -o yaml should give something like

apiVersion: projectcalico.org/v3
kind: Node
metadata:
  name: node-hostname
spec:
  bgp:
    asNumber: 64512
    ipv4Address: 10.244.0.1/24
    ipv6Address: 2000:db8:85a3::8a2e:370:7335/120
    ipv4IPIPTunnelAddr: 192.168.0.1

@joseluisgonzalezca
Copy link
Author

Hi again @tomastigera!

I was finally able to make it work! The interface binding part was not correct and it was using enX2 instead of enX1. I have modified the configuration of the Helm chart by setting the interface attribute of the CalicoNetworkSpec object.

After doing that, I executed ip -d link show dev vxlan.calico in the nodes and all of them are showing now the enX1 interface.

Regarding the iptables rules issue, I believe there was a moment during the provisioning of the nodes when we accidentally deleted some of the rules. I thought that by restarting kube-proxy they would be automatically recreated, but it seems that a full resync of the rules doesn't happen unless you either restart the node or delete certain rules (see kubernetes/kubernetes#129128 (comment)). In short, I restarted the nodes and the kube-proxy rules reappeared. After that, everything started working correctly.

I want to sincerely thank you for your help. It was really useful for debugging the issue and for learning a bit more about the inner workings of Calico.

You can close the issue :)

@tomastigera
Copy link
Contributor

Thanks for reporting back and great that it works for you! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants