Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nat-gateway.sh init not exec after k8s cluster reboot #3241

Closed
wenwenxiong opened this issue Sep 20, 2023 · 18 comments · Fixed by #3254
Closed

nat-gateway.sh init not exec after k8s cluster reboot #3241

wenwenxiong opened this issue Sep 20, 2023 · 18 comments · Fixed by #3254
Assignees
Labels
bug Something isn't working

Comments

@wenwenxiong
Copy link
Contributor

Expected Behavior

vpc_nat_gateway pod running normal for nat rules exec after reboot whole k8s cluster node

Actual Behavior

vpc_nat_gateway pod all iptables rules disappear after reboot whole k8s cluster node

Steps to Reproduce the Problem

  1. k8s + kubeovn 1.12 running ,make vpc-nat-gateway pod and fip iptables rules etc
  2. reboot whole cluster node
  3. vpc-nat-gateway pod will running auto but all iptables rules disappear

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    (paste your output here)
    
  • kube-ovn version:

    (paste your output here)
    
  • operation-system/kernel version:

    Output of awk -F '=' '/PRETTY_NAME/ { print $2 }' /etc/os-release:
    Output of uname -r:

    (paste your output here)
    

kube-ovn-controller logs :

I0920 17:37:38.452561 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:37:38.457852 1 node.go:752] start to check gateway status
I0920 17:37:41.604881 1 vpc_nat_gateway.go:603] handle update subnet route for nat gateway gw1
I0920 17:37:41.605216 1 vpc_nat_gateway.go:719] bash /kube-ovn/nat-gateway.sh ext-subnet-route-add 172.56.0.0/24,172.56.0.1
I0920 17:37:41.683340 1 vpc_nat_gateway.go:727] failed to ExecuteCommandInContainer, stdOutput: ext-subnet-route-add 172.56.0.0/24,172.56.0.1
nat gateway not inited
E0920 17:37:41.683393 1 vpc_nat_gateway.go:729] command terminated with exit code 1
E0920 17:37:41.683414 1 vpc_nat_gateway.go:630] failed to exec nat gateway rule, err: command terminated with exit code 1
E0920 17:37:41.683654 1 vpc_nat_gateway.go:197] process: updateVpcSubnet. err: error syncing 'gw1': failed to exec nat gateway rule, err: command terminated with exit code 1, requeuing
I0920 17:37:42.361659 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"podefb02104595","name":"net1-3","namespace":"default"}], default
I0920 17:37:42.368780 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"pod774883e771f","name":"net1","namespace":"default"}], default
I0920 17:37:42.379176 1 network_attachment.go:66] parsePodNetworkAnnotation: kube-system/ovn-vpc-external-network, kube-system
I0920 17:37:42.379197 1 network_attachment.go:21] parsePodNetworkObjectName: kube-system/ovn-vpc-external-network
I0920 17:37:43.453296 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:37:43.458556 1 node.go:752] start to check gateway status
I0920 17:37:48.453692 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:37:48.459161 1 node.go:752] start to check gateway status
I0920 17:37:53.454383 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:37:53.459781 1 node.go:752] start to check gateway status
I0920 17:37:58.142309 1 provider-network.go:16] start to sync ProviderNetwork status
I0920 17:37:58.454521 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:37:58.460799 1 node.go:752] start to check gateway status
I0920 17:37:59.095805 1 node.go:942] start to check node port-group status
I0920 17:37:59.096066 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"pod774883e771f","name":"net1","namespace":"default"}], default
I0920 17:37:59.106064 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"podefb02104595","name":"net1-3","namespace":"default"}], default
I0920 17:38:01.308490 1 node.go:68] enqueue update node master2
I0920 17:38:01.308673 1 node.go:594] handle update node master2
I0920 17:38:01.684658 1 vpc_nat_gateway.go:603] handle update subnet route for nat gateway gw1
I0920 17:38:01.684983 1 vpc_nat_gateway.go:719] bash /kube-ovn/nat-gateway.sh ext-subnet-route-add 172.56.0.0/24,172.56.0.1
I0920 17:38:01.773061 1 vpc_nat_gateway.go:727] failed to ExecuteCommandInContainer, stdOutput: ext-subnet-route-add 172.56.0.0/24,172.56.0.1
nat gateway not inited
E0920 17:38:01.773185 1 vpc_nat_gateway.go:729] command terminated with exit code 1
E0920 17:38:01.773226 1 vpc_nat_gateway.go:630] failed to exec nat gateway rule, err: command terminated with exit code 1
E0920 17:38:01.773370 1 vpc_nat_gateway.go:197] process: updateVpcSubnet. err: error syncing 'gw1': failed to exec nat gateway rule, err: command terminated with exit code 1, requeuing
I0920 17:38:02.389130 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"podefb02104595","name":"net1-3","namespace":"default"}], default
I0920 17:38:02.397445 1 network_attachment.go:66] parsePodNetworkAnnotation: [{"interface":"pod774883e771f","name":"net1","namespace":"default"}], default
I0920 17:38:02.407975 1 network_attachment.go:66] parsePodNetworkAnnotation: kube-system/ovn-vpc-external-network, kube-system
I0920 17:38:02.407994 1 network_attachment.go:21] parsePodNetworkObjectName: kube-system/ovn-vpc-external-network
I0920 17:38:03.455243 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:38:03.461586 1 node.go:752] start to check gateway status
I0920 17:38:05.880274 1 node.go:68] enqueue update node master1
I0920 17:38:05.880326 1 node.go:594] handle update node master1
I0920 17:38:06.776480 1 node.go:68] enqueue update node master3
I0920 17:38:06.776616 1 node.go:594] handle update node master3
I0920 17:38:08.456118 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:38:08.462390 1 node.go:752] start to check gateway status
I0920 17:38:13.456578 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:38:13.462839 1 node.go:752] start to check gateway status
I0920 17:38:18.457453 1 vpc_dns.go:520] the vpc-dns configuration is not set
I0920 17:38:18.463780 1 node.go:752] start to check gateway status
I0920 17:38:21.774577 1 vpc_nat_gateway.go:603] handle update subnet route for nat gateway gw1

@wenwenxiong
Copy link
Contributor Author

wenwenxiong commented Sep 20, 2023

k8s 1.24.8 + kubeovn 1.12.0 + ubunutu 18.0.4 +3 node cluster

@zbb88888
Copy link
Collaborator

as the log shown:

failed to ExecuteCommandInContainer, stdOutput: ext-subnet-route-add 172.56.0.0/24,172.56.0.1
nat gateway not inited

the nat gw is not inited, so no rule in the nat gw pod.

@zbb88888
Copy link
Collaborator

I do not know why the route add failed in your env

bash /kube-ovn/nat-gateway.sh ext-subnet-route-add 172.56.0.0/24,172.56.0.1

I think you should check the cmd in the pod manually

@wenwenxiong
Copy link
Contributor Author

it can exec command in vpc--nat-gateway pod in which kube-ovn-controller notify vpc-nat-gateway pod create event.
but when i reboot whole k8s cluster node ,it seems likely vpc--nat-gateway pod created before kube-ovn-controller pod stared , so kube-ovn-controller cannot notify vpc-nat-gateway pod create event. it ignore /kube-ovn/nat-gateway.sh init command exec. because when i delete vpc-nat-gateway pod manunal, the recreated vpc-nat-gateway pod running normal with iptables rules.

@zbb88888
Copy link
Collaborator

it can exec command in vpc--nat-gateway pod in which kube-ovn-controller notify vpc-nat-gateway pod create event. but when i reboot whole k8s cluster node ,it seems likely vpc--nat-gateway pod created before kube-ovn-controller pod stared , so kube-ovn-controller cannot notify vpc-nat-gateway pod create event. it ignore /kube-ovn/nat-gateway.sh init command exec. because when i delete vpc-nat-gateway pod manunal, the recreated vpc-nat-gateway pod running normal with iptables rules.

The nat gw init should be executed just in the process of the pod running, this could fix the issue:

image

remove the init from the code, and set the init in the pod startup cmd

@zbb88888
Copy link
Collaborator

@wenwenxiong Hi, what do you think about this, do you have the free time to try it?

@zbb88888 zbb88888 self-assigned this Sep 21, 2023
@zbb88888 zbb88888 added the bug Something isn't working label Sep 21, 2023
@wenwenxiong
Copy link
Contributor Author

wenwenxiong commented Sep 22, 2023

i add /kube-ovn/nat-gateway.sh init command in pod initcontainer,
01

after whole k8s cluster reboot , i found

bash /kube-ovn/nat-gateway.sh ext-subnet-route-add 172.56.0.0/24,172.56.0.1
02

was exec successful, but no command like

 bash /kube-ovn/nat-gateway.sh eip-add ...
 bash /kube-ovn/nat-gateway.sh floating-ip-add ...

running after whole k8s cluster reboot in kube-ovn-controller logs

so the vpc-gateway-nat pod missing eip and fip iptables rules.
03

04

this is another problem ?

05

@zbb88888
Copy link
Collaborator

zbb88888 commented Sep 22, 2023

maybe the kube-ovn-controller started should check the nat gw pod has been recreated, and trigger all nats re-creation

the nat created time should behind the pod creationTime,if not ,should recreate

@wenwenxiong
Copy link
Contributor Author

how to reslove it ?

@zbb88888
Copy link
Collaborator

the kube-ovn-controller started should check the nat gw pod has been recreated, and trigger all nats re-creation

the nat created time should behind the pod creationTime,if not ,should recreate

After the kube-ovn-controller restarted, should make sure all the nats have been recreated after the creation of the nat gw pod, if not, trigger all nats belonging to the nat gw pod to re-create.

@wenwenxiong
Copy link
Contributor Author

the kube-ovn-controller started should check the nat gw pod has been recreated, and trigger all nats re-creation
the nat created time should behind the pod creationTime,if not ,should recreate

After the kube-ovn-controller restarted, should make sure all the nats have been recreated after the creation of the nat gw pod, if not, trigger all nats belonging to the nat gw pod to re-create.

it seems hard for me to do this

@wenwenxiong
Copy link
Contributor Author

can you fix it ?

@zbb88888
Copy link
Collaborator

can you fix it ?

i will try later

@zbb88888
Copy link
Collaborator

这个目前把路由移到 init 中是有提升的,但是需要配合后续的 kube-ovn-controller 需要判断iptables nats 的时间是否早于 nat gw pod 的创建时间,这种 nat 需要触发重建。

@zbb88888
Copy link
Collaborator

@wenwenxiong Hi, i have a pr maybe could cover the issue (nats not recoverd), but I do not really know how you test this. can you help make a confirmation?

the pr is: https://github.com/kubeovn/kube-ovn/pull/3261/files

@wenwenxiong
Copy link
Contributor Author

wenwenxiong commented Sep 27, 2023

i found it has not this issue in ubuntu22.04 os, it is look likes iptables version diff in host os and vpc-nat-gateway container os lead this.
in ubuntu18.04, iptables 1.6.1 (vpc-gate-way pod iptables 1.8.9) it throw error like this

vpc_nat_gateway.go:731] failed to ExecuteCommandInContainer errOutput: Warning: Extension DNAT is not supported, missing kernel module?
E0927 12:45:54.228902       1 vpc_nat_gw_eip.go:255] failed to create eip 'eipf03' in pod, Warning: Extension DNAT is not supported, missing kernel module?
E0927 12:45:54.228941       1 vpc_nat_gw_eip.go:111] error syncing 'eipf03': Warning: Extension DNAT is not supported, missing kernel module?, requeuing
01 02

in ubuntu 22.04 iptables 1.8.7 (vpc-gate-way pod iptables 1.8.9) it is normal
03

some related docker kubernetes issue
robbertkl/docker-ipv6nat#47
kubernetes/kubernetes#94754
kubernetes/kubernetes#95409

@wenwenxiong
Copy link
Contributor Author

@wenwenxiong Hi, i have a pr maybe could cover the issue (nats not recoverd), but I do not really know how you test this. can you help make a confirmation?

the pr is: https://github.com/kubeovn/kube-ovn/pull/3261/files

it seems work in ubuntu 22.04 os.
but has error in ubuntu 18.04 os like

failed to ExecuteCommandInContainer errOutput: Warning: Extension DNAT is not supported, missing kernel module?

@zbb88888
Copy link
Collaborator

@wenwenxiong Hi, i have a pr maybe could cover the issue (nats not recoverd), but I do not really know how you test this. can you help make a confirmation?
the pr is: https://github.com/kubeovn/kube-ovn/pull/3261/files

it seems work in ubuntu 22.04 os. but has error in ubuntu 18.04 os like

failed to ExecuteCommandInContainer errOutput: Warning: Extension DNAT is not supported, missing kernel module?

Thanks for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants