-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication issue with VXLAN if traffic needs to be redirected to the node where the pod is running #10149
Comments
What is your vxlan setting. Do you use the default VXLANCrossSubnet? Does plain pod-pod within your cluster work across vxlan? If not, do you observe, as in some of the other linked cases, that icmp makes it through? I am a little puzzled by working vs non-working cluster. Are you saying that you have pretty much identical clusters and one work well and the other does not? Are you sure that in the working cluster vxlan is on the path, that is, forwarding happens between nodes in different subnets in case you use the default VXLANCrossSubnet.
The fact that (udp) csum is incorrect on egress is expected as it is meant to be fixed by the offload to the device. Are you able to see the packet at its destination node? Does it still have wrong csum? If yes, then the packet would be dropped.
That seems like in one case the forwarding goes via vxlan and in the other one it does not. Try to use tcp dump on |
Could you also provide |
Hi @tomastigera! Thanks for your quick response. I will provide some more info about our setup in case it might help.
For the IPPool resource configuration we are using the default values provided by Rancher. This is the current definition: apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
creationTimestamp: "2025-03-06T14:38:13Z"
generation: 1
labels:
app.kubernetes.io/managed-by: tigera-operator
name: default-ipv4-ippool
resourceVersion: "869"
uid: fdf33413-2adf-4ef7-98c5-a11213ed988a
spec:
allowedUses:
- Workload
- Tunnel
blockSize: 26
cidr: 10.42.0.0/16
ipipMode: Never
natOutgoing: true
nodeSelector: all()
vxlanMode: Always
Yeah! Inter-pod communication is working as expected. For example, I have these replicas for my nginx-ingress app:
I can enter one of those pods and ping/curl the others running in other nodes without problem:
Yes. What I was referring to is that we have two similar clusters. The versions of K8S, Calico and the underlying OS are the same. The deployment for Calico provided by Rancher is not modified in both cases. In the test environment, all services are working as expected. In the other one, the traffic is not working correctly in some cases. That's what bothers me the most...
I can see that something is wrong here. If I execute the command, I see that the virtual interface is binded to enX2 (the last one used for the storage network). It should be enX1!
Is there a way to force Calico to use an specific interface?
Is this behaviour normal? I understand that, with this configuration of Calico, all should be routed through the vxlan interface. Correct me if I'm wrong.
As you can see in the following screenshot, 3 first TCP SYN packets are received from the LB but there are not routed anywhere. However, the 4th one is acknowledged correctly because, "luckily", it's routed to a pod (10.42.37.24) that is running in the node that received the request. What I don't understand is why it's using the IP of the first interface (enX0) as source. In the other cases, I cannot see any additional traffic apart from the original SYN packets received from the LB. BTW, this is the command that I have used to sniff the traffic (removing traffic generated by kubernetes/rancher to obtain a cleaner output):
PCAP file => calico-debug.pcap.zip Thanks for you help :) |
More debug information. In the case of the working cluster, the FORWARD chain looks like this:
However, the other one looks "emptier":
Could this be the cause? |
There seem to be missing kube-proxy chains in the non-working cluster. Can you reach pod-pod via a service? Does your kube-proxy work? You should see each SYN that is being forwarded 2x (perhaps with NATed source), inbound via the enX0 and outbound via vxlan.calico - like you can see that with the successful connection to the local pod (outbound via the pod's iface). It does not seem like the packets get forwarded to any pod.
Yes, it should be enX1. However, it does not seem to be an issue for pod-pod over vxlan. If you were tcpdumping for udp port 4789 (vxlan) you would likely see vxlan packets leaving over enX1 with enX2 source ip. They are likely being dropped at the destination node because of unexpected source IP. Check which IP/device is selected and how is the IP autodetection method configured.
|
Hi again @tomastigera! I was finally able to make it work! The interface binding part was not correct and it was using enX2 instead of enX1. I have modified the configuration of the Helm chart by setting the After doing that, I executed Regarding the iptables rules issue, I believe there was a moment during the provisioning of the nodes when we accidentally deleted some of the rules. I thought that by restarting kube-proxy they would be automatically recreated, but it seems that a full resync of the rules doesn't happen unless you either restart the node or delete certain rules (see kubernetes/kubernetes#129128 (comment)). In short, I restarted the nodes and the kube-proxy rules reappeared. After that, everything started working correctly. I want to sincerely thank you for your help. It was really useful for debugging the issue and for learning a bit more about the inner workings of Calico. You can close the issue :) |
Thanks for reporting back and great that it works for you! 🎉 |
Expected Behavior
According to Kubernetes design, NodePort services must be "available" on all nodes of the cluster, regardless of whether the target pods of the service are running on those nodes. When introducing external traffic to the cluster through a loadbalancer, requests will be distributed among the different nodes. If the node to which the request is directed does not have the required pod, it must be able to redirect the traffic correctly.
Current Behavior
Service exposed externally only works if the targeting pod is running in the same node where the original was sent to. In other cases, a timeout occurs and the workload cannot be accessed.
Possible Solution
Other issues like #9433, #9985 or #8860 suggest disabling checksum verification. Some people confirmed that this workaround was sufficient to resolve their problem. For us, it was not the case. It is possible that the issue comes from another source and it's not directly related. However, the symptoms are very similar to other recent issues.
Steps to Reproduce (for bugs)
Context
We have a test environment that is practically equivalent to the one that is causing us problems. The are only two main differences:
The procedures performed on both clusters are the same to make the comparison as accurate as possible in order to determine the real source of the problem.
MTU
First of all, there was an MTU configuration problem. Interfaces were not being correctly auto-detected because of the
mtuIfacePattern
default value. I have changed the configuration to match Debian 12 interface name pattern (check #10148) and now all is working as expected. Here are the logs of one of the calico nodes:Checksum verification
As stated in other issues, there is a bug in the kernel and checksum generation for packets is not correctly made, causing problems with its validation. By executing a basic tcpdump with
-vv
flag, we can see that errors are present:As a workaround, some people suggested disabling it. I have done it if all my nodes by executing the following command in my node interfaces (public interface enX0, private interface enX1 for inter-node communication and vxlan.calico):
However, the problem persists.
Iptables configuration
As we all know, traffic is routed through services by creating rules in iptables. Because of that, it's essential that the service is correctly configured and that all rules are present. Here are the rules created in one of my nodes:
PREROUTING chain
KUBE-SERVICES chain
KUBE-NODEPORTS chain
KUBE-EXT-EDNDUDH2C75GIR6O chain
KUBE-SVC-EDNDUDH2C75GIR6O chain
Specific pod chain (the others are equivalent)
OUTPUT and POSTROUTING
Other output and postrouting related chains
As you can see, everything looks normal. I have compared this rules with the testing environment which is currently working and they are the same.
IPs and routes of one of the nodes
There are routes for the other nodes across the VXLAN tunnels so I think this configuration is also correct. The only thing that it's strange is that the 'vxlan.calico` interface has the UNKNOWN state. Is this normal?
Conntrack status
This point is the only one where we have detected a different behavior between the faulty cluster and the test cluster. In the case of the test cluster (the working one), we can see that the NAT translation is done with the IP of the VXLAN interface:
However, in the case of the faulty cluster, the translation is done with the public interface IP:
Your Environment
If you need any more information about our environment, we are happy to provide it. I hope this context is sufficient to help find the issue with our cluster.
Thanks in advance for your time and help!
The text was updated successfully, but these errors were encountered: