-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 集群节点重启,nat-gw pod eip丢失 (非必现) #4987
Comments
I found a similar open issue that might be relevant to your problem. The issue discusses EIP connectivity problems after restarting the kube-ovn network plugin. The suggested solutions include:
To address the specific issue where the vpc-nat-gw pod's EIP is lost after a cluster node restart, ensure that the redo EIP logic is triggered. This can be achieved by handling the update of the VPC EIP properly. The function To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report |
Warning NodeNotReady 7m28s node-controller Node is not ready
Warning FailedCreatePodSandBox 7m20s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "992a0881ecf83dc1dcd71de5ed135d63c8bb06ae41d62a1b138c1ffee69f1c9b": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/multus.sock: connect: no such file or directory
Warning FailedCreatePodSandBox 7m6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4102a18946ec67561023a0fa1a5dfe6bea3adddc52568962de885ebe30dc084c": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/multus.sock: connect: no such file or directory
Normal SandboxChanged 6m51s (x3 over 7m20s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 5m21s multus Add eth0 [192.168.210.254/24] from kube-ovn
Normal AddedInterface 4m28s multus Add net1 [172.12.0.16/24] from kube-system/net-96m3optj
Normal Pulled 4m25s kubelet Container image "docker.io/kubeovn/vpc-nat-gateway:v1.12.22" already present on machine
Normal Created 4m25s kubelet Created container vpc-nat-gw-init
Normal Started 4m25s kubelet Started container vpc-nat-gw-init
Normal Pulled 4m25s kubelet Container image "docker.io/kubeovn/vpc-nat-gateway:v1.12.22" already present on machine
Normal Created 4m25s kubelet Created container vpc-nat-gw
Normal Started 4m25s kubelet Started container vpc-nat-gw
Warning FailedAddingInterface 2m21s (x4 over 4m27s) pod-networks-updates pod [kube-system/vpc-nat-gw-eip-z10eqvi9-0]: failed adding interface to network: net-96m3optj
这里面的的 NodeNotReady , /run/multus/multus.sock: connect:, failed adding interface to network: net-96m3optj 可能是 node 的网络配置有问题,node 没问题的话, pod 是可以建出来的。 如果 pod 没有ready,那eip的redo应该是失败的,会持续重试的。 |
集群所有节点重启,某一个nat-gw pod可能未能正常重建(仅仅是容器重启了,pod的creationTimestamp未变),观察到该nat-gw的pod的配置(曾添加的eip、iptables规则、路由)均丢失了,且无法恢复(其他绑定的相关资源的redo逻辑无法触发),实际上手动rm -f 一个正常nat-gw pod的“/pause”容器和“vpc-nat-gw”容器,就能够模拟这种情况。 正常环境上pod的容器可能因为各种不太可控的原因被杀掉重启,都会导致nat-gw pod的配置丢失,以这次的情况为例,用户可能不想因为仅仅集群节点重启,其中某个EIP可能就会异常不通,需要后台手动重建nat-gw pod 才能恢复😢。看是否能够使用容器最新的启动时间 pod.status.containerStatuses.state.running.startedAt来判断是否执行其他绑定资源的redo逻辑呢? |
@kldancer 大佬,你可以试一下,一开始重启的逻辑只是基于 pod的creationTimestamp |
换了个思路,还是基于 pod的creationTimestamp。
最终达到nat-gw pod容器发生重启后,能触发nat-gw StatefulSet的重建。 |
@kldancer 大佬,我感觉你的提议更合理,可以帮忙提个PR么? |
Kube-OVN Version
v1.12.22
Kubernetes Version
v1.31.1
Operation-system/Kernel Version
5.10.0-136.12.0.86.4.hl202.x86_64
Description
存在该问题:
重启集群节点后,vpc-nat-gw pod存在概率未正常重建(仅重启容器),gw pod的 creationTimestamp未变,导致redo eip的逻辑未成功触发。从而导致该nat-gw pod绑定的eip丢失。
请问该如何规避或解决
pod 事件、kubelet日志、multus-cni、kube-ovn-cni日志如下:
Steps To Reproduce
Current Behavior
nat-gw pod容器异常重启(pod未重建),已有的eip能重新绑定
Expected Behavior
nat-gw pod容器异常重启(pod未重建),已绑定的eip丢失
The text was updated successfully, but these errors were encountered: