Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ip资源未被回收,子网ip占用残留 #2125

Closed
syang1997 opened this issue Dec 5, 2022 · 25 comments · Fixed by #2143
Closed

ip资源未被回收,子网ip占用残留 #2125

syang1997 opened this issue Dec 5, 2022 · 25 comments · Fixed by #2143

Comments

@syang1997
Copy link

Expected Behavior

ip资源未被回收,子网ip占用残留

Actual Behavior

Steps to Reproduce the Problem

apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  name: subnet-cdq57ea8j5gqg4vf8ak0
spec:
  cidrBlock: 168.50.8.0/24
  default: false
  excludeIps:
  - 168.50.8.254
  gateway: 168.50.8.254
  gatewayNode: ""
  gatewayType: distributed
  natOutgoing: false
  private: false
  protocol: IPv4
  provider: ovn
  vpc: vpc-cdq56t28j5gqg4vf8ajg
NAME                          PROVIDER   VPC                        PROTOCOL   CIDR            PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS
subnet-cdq57ea8j5gqg4vf8ak0   ovn        vpc-cdq56t28j5gqg4vf8ajg   IPv4       168.50.8.0/24   false     false   false     distributed   11       242           0        0             ["168.50.8.254"]
[root@iaas-cms-ctrl-1 ~]# k get ip | grep 168.50.8.
vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn                                                       168.50.8.2               00:00:00:A9:2E:06   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce3vprq8j5ggeis9ivig.yiaas.net1.yiaas.ovn                                                       168.50.8.1               00:00:00:BF:18:12   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn                                                       168.50.8.2               00:00:00:21:DA:C3   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41etq8j5ggeis9ivo0.yiaas.net1.yiaas.ovn                                                       168.50.8.3               00:00:00:52:22:AC   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41hri8j5ggeis9ivqg.yiaas.net1.yiaas.ovn                                                       168.50.8.4               00:00:00:5A:55:C0   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41kta8j5ggeis9ivs0.yiaas.net1.yiaas.ovn                                                       168.50.8.5               00:00:00:9A:39:09   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce441k28j5ggeis9ivug.yiaas.net1.yiaas.ovn                                                       168.50.8.6               00:00:00:E9:33:BB   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4l38i8j5ggeis9j050.yiaas.net1.yiaas.ovn                                                       168.50.8.7               00:00:00:FC:DB:BC   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4qfgq8j5ggeis9j070.yiaas.net1.yiaas.ovn                                                       168.50.8.8               00:00:00:EF:C2:10   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4s0hq8j5ggeis9j0hg.yiaas.net1.yiaas.ovn                                                       168.50.8.9               00:00:00:81:B5:1B   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4s9qq8j5ggeis9j1lg.yiaas.net1.yiaas.ovn                                                       168.50.8.10              00:00:00:B5:37:26   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.7", GitCommit:"42c05a547468804b2053ecf60a3bd15560362fc2", GitTreeState:"clean", BuildDate:"2022-05-24T12:30:55Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}


- kube-ovn version:

v1.10.7


- operation-system/kernel version:

**Output of `awk -F '=' '/PRETTY_NAME/ { print $2 }' /etc/os-release`:**
**Output of `uname -r`:**

CentOS Stream 8 5.4.223-1.el8.elrepo.x86_64


<!-- Any other additional information -->
@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

IP 的crd的记录存在vm已删除未清理,存在vm ip crd重复。

@oilbeater
Copy link
Collaborator

可能和 #2087 相关,可以更新这个 patch 再看

@hongzhen-ma
Copy link
Collaborator

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

当前这个重复的ip crd 绝对不止12分钟

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

[centos@iaas-cms-ctrl-1 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/
/var/log/kube-ovn/kube-ovn-controller.log:I1205 14:12:20.809618       7 controller.go:461] Starting OVN controller
[centos@iaas-cms-ctrl-1 ~]$
[centos@iaas-cms-ctrl-1 ~]$ ssh iaas-cms-ctrl-2
[centos@iaas-cms-ctrl-2 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/
[centos@iaas-cms-ctrl-2 ~]$ logout
Connection to iaas-cms-ctrl-2 closed.
[centos@iaas-cms-ctrl-1 ~]$ ssh iaas-cms-ctrl-3
[centos@iaas-cms-ctrl-3 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/

kube-ovn-controller 最近没有持续crash的log,应该是没有持续崩溃重启

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

[root@iaas-cms-ctrl-1 ovn]# k get ip vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn -o yaml
apiVersion: kubeovn.io/v1
kind: IP
metadata:
  creationTimestamp: "2022-11-30T06:53:32Z"
  generation: 3
  labels:
    ovn.kubernetes.io/subnet: subnet-cdq57ea8j5gqg4vf8ak0
    subnet-cdq57ea8j5gqg4vf8ak0: ""
  name: vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn
  resourceVersion: "17229232"
  uid: 92614203-af24-4327-ae14-edbc9a41c771
spec:
  attachIps: []
  attachMacs: []
  attachSubnets: []
  containerID: ""
  ipAddress: 168.50.8.2
  macAddress: 00:00:00:A9:2E:06
  namespace: yiaas
  nodeName: iaas-cms-ctrl-1
  podName: vm-ce3fr4q8j5gh613m5u50
  podType: VirtualMachine
  subnet: subnet-cdq57ea8j5gqg4vf8ak0 # 子网id不一致
  v4IpAddress: 168.50.8.2
  v6IpAddress: ""
[root@iaas-cms-ctrl-1 ovn]# k get ip vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn -o yaml
apiVersion: kubeovn.io/v1
kind: IP
metadata:
  creationTimestamp: "2022-12-01T02:42:52Z"
  generation: 4
  labels:
    ovn.kubernetes.io/subnet: subnet-cdq57ea8j5gqg4vf8ak0
    subnet-cdq57ea8j5gqg4vf8ak0: ""
  name: vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn
  resourceVersion: "18109557"
  uid: 9ea4b4f9-6317-4a8a-ba1e-3afafa77db48
spec:
  attachIps: []
  attachMacs: []
  attachSubnets: []
  containerID: ""
  ipAddress: 168.50.8.2
  macAddress: 00:00:00:21:DA:C3
  namespace: yiaas
  nodeName: iaas-cms-ctrl-2
  podName: vm-ce418ki8j5ggeis9ivmg
  podType: VirtualMachine
  subnet: subnet-cdq57ea8j5gqg4vf8ak0  # 子网id不一致
  v4IpAddress: 168.50.8.2
  v6IpAddress: ""
[root@iaas-cms-ctrl-1 ovn]#


# 这个保留了很长时间了


# 这两个ip的子网不一样,所以ip才冲突了,归根应该是子网冲突

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

825fb609671fca18aad6e6f6d576f50

webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

@oilbeater
Copy link
Collaborator

如果 k8s 开启审计日志功能可以看一下该 ip 资源的操作记录,是不是有删除后重复创建的操作

@hongzhen-ma
Copy link
Collaborator

825fb609671fca18aad6e6f6d576f50

webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

还有其他的webhook 的问题,可以一起列一下。subnet 这个校验我确认下。

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

825fb609671fca18aad6e6f6d576f50
webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

还有其他的webhook 的问题,可以一起列一下。subnet 这个校验我确认下。

总结我们遇到的问题:

  1. subnet cidr 只配了ip没有掩码,kube-ovn-controller 直接崩溃
  2. 同一vpc下可以通过创建或者更新导致存在两个一样subnet的子网
  3. 应限制子网exclude-ip过多,导致完全无ip可用的情况发生

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 5, 2022

如果 k8s 开启审计日志功能可以看一下该 ip 资源的操作记录,是不是有删除后重复创建的操作

看起来这两个重复的ip创建,相隔了20+个小时,而且不属于同一个subnet,应该不是同一个pod触发的删除后重建的动作。

审计日志这个功能正在计划中,暂无。

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 6, 2022

# master 分支 vm ip 也没有清理
[root@iaas-cms-ctrl-1 ~]# k get ip  | grep  168.0.0
vm-ce6s14q8j5gjlb83p58g.yiaas.net1.yiaas.ovn                                             168.0.0.2             00:00:00:A7:B8:D2   iaas-cms-ctrl-1   subnet-ce6rhna8j5gjlb83p4fg
vm-ce6tl1a8j5gjlb83p5e0.yiaas.net1.yiaas.ovn                                             168.0.0.3             00:00:00:A1:E1:70   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vm-ce7am7q8j5gjlb83p5lg.yiaas.net1.yiaas.ovn                                             168.0.0.4             00:00:00:AC:78:8F   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0.kube-system                                        168.0.0.253           00:00:00:1E:D3:9B   iaas-cms-ctrl-3   subnet-ce6rhna8j5gjlb83p4fg
[root@iaas-cms-ctrl-1 ~]#



# 35 分钟后观察 依旧是存在的


@oilbeater
Copy link
Collaborator

是所有都会遗留么,还是批量创建删除部分没有清理?

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 6, 2022

是所有都会遗留么,还是批量创建删除部分没有清理?

[root@iaas-cms-ctrl-1 ~]# k get ip  | grep  168.0.0
vm-ce6s14q8j5gjlb83p58g.yiaas.net1.yiaas.ovn                                             168.0.0.2             00:00:00:A7:B8:D2   iaas-cms-ctrl-1   subnet-ce6rhna8j5gjlb83p4fg
vm-ce6tl1a8j5gjlb83p5e0.yiaas.net1.yiaas.ovn                                             168.0.0.3             00:00:00:A1:E1:70   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vm-ce7am7q8j5gjlb83p5lg.yiaas.net1.yiaas.ovn                                             168.0.0.4             00:00:00:AC:78:8F   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0.kube-system                                        168.0.0.253           00:00:00:1E:D3:9B   iaas-cms-ctrl-3   subnet-ce6rhna8j5gjlb83p4fg
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k get po -A -o wide | grep 168.0.0
kube-system            vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0              1/1     Running            0                3h53m   168.0.0.253    iaas-cms-ctrl-3   <none>           <none>
[root@iaas-cms-ctrl-1 ~]#

# 就现在集群的信息看,应该是删除的虚拟机的ip都遗留下来了, 目前虚拟机的测试都是单个单个创建的

@oilbeater
Copy link
Collaborator

是使用什么方式创建的 vm,我们在 1.9 版本上用 VirtualMachine 这个资源创建 vm ,删除这个资源后可以正常回收 ip

@syang1997
Copy link
Author

是使用什么方式创建的 vm,我们在 1.9 版本上用 VirtualMachine 这个资源创建 vm ,删除这个资源后可以正常回收 ip
vm定义

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm-ce7am7q8j5gjlb83p5lg
  namespace: iaas
spec:
  dataVolumeTemplates:
  - apiVersion: cdi.kubevirt.io/v1beta1
    kind: DataVolume
    metadata:
      annotations:
        cdi.kubevirt.io/cloneStrategyOverride: copy
      name: vol-ce7am7q8j5gjlb83p5k0
      namespace: yiaas
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 25Gi
        storageClassName: rbd.csi.ssd
        volumeMode: Block
      source:
        pvc:
          name: img-ce3faji8j5gh613m5tkg
          namespace: yiaas
  - apiVersion: cdi.kubevirt.io/v1beta1
    kind: DataVolume
    metadata:
      name: vol-ce7am7q8j5gjlb83p5kg
      namespace: yiaas
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 25Gi
        storageClassName: rbd.csi.ssd
        volumeMode: Block
      source:
        blank: {}
  instancetype:
    kind: VirtualMachineInstancetype
    name: small
    revisionName: vm-ce7am7q8j5gjlb83p5lg-small-a8dee6d1-f20d-4cf2-bf1a-2fb148bd05e5-1
  running: true
  template:
    metadata:
      annotations:
        kubevirt.io/hide-pod-network: "true"
        net1.virtualmachine.fields.yiaas.yealink.com/network: vpc-ce6rhj28j5gjlb83p4f0
        net1.yiaas.ovn.kubernetes.io/allow_live_migration: "true"
        net1.yiaas.ovn.kubernetes.io/logical_switch: subnet-ce6rhna8j5gjlb83p4fg
      creationTimestamp: null
    spec:
      accessCredentials:
      - sshPublicKey:
          propagationMethod:
            configDrive: {}
          source:
            secret:
              secretName: ac-cdqot2a8j5gqg4vf8bt0
      dnsConfig:
        nameservers:
        - 168.0.0.0
      dnsPolicy: ClusterFirst
      domain:
        devices:
          disks:
          - disk: {}
            name: vol-ce7am7q8j5gjlb83p5k0
          - disk: {}
            name: vol-ce7am7q8j5gjlb83p5kg
          - disk: {}
            name: cdi-ce7am7q8j5gjlb83p5l0
          interfaces:
          - bridge: {}
            name: wk
        machine:
          type: q35
        resources: {}
      networks:
      - multus:
          networkName: net1
        name: wk
      volumes:
      - dataVolume:
          name: vol-ce7am7q8j5gjlb83p5k0
        name: vol-ce7am7q8j5gjlb83p5k0
      - dataVolume:
          name: vol-ce7am7q8j5gjlb83p5kg
        name: vol-ce7am7q8j5gjlb83p5kg
      - cloudInitConfigDrive:
          userData: |
            #cloud-config
            ssh_pwauth: True
            groups:
              - admingroup: [root,sys]
            users:
              - name: root
                gecos: Foo B. Bar
                sudo: ALL=(ALL) NOPASSWD:ALL
                groups: root
                expiredate: '2032-09-01'
                lock_passwd: false
                plain_text_passwd: 123456
        name: cdi-ce7am7q8j5gjlb83p5l0

@syang1997
Copy link
Author

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

@hongzhen-ma
Copy link
Collaborator

1、subnet cidr 只配了ip没有掩码,kube-ovn-controller 直接崩溃
这个在webhook中更新了 subnet.spec.cidr 的检查,但是即使是原来的镜像,也不会出现 kube-ovn-controller crash的现象
应该是在kube-ovn-controller log中有类似报错
企业微信截图_4e5fa511-9c8f-4a17-9fc6-a2a3232334b0

2、同一vpc下可以通过创建或者更新导致存在两个一样subnet的子网
这个我看在webhook中有校验,创建或者更新子网,都有cidr 冲突的校验,测试了下也没遇到能更新成功的情况
创建冲突子网
企业微信截图_39093895-4050-444b-8074-1791a80e9841
更新子网,使cidr 冲突
企业微信截图_8870e789-5611-455a-8005-31f14f8836c2

3、应限制子网exclude-ip过多,导致完全无ip可用的情况发生
这个感觉是使用者的问题,暂时先不加校验了

@hongzhen-ma
Copy link
Collaborator

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

对于这个现象,需要确认下,keep-vm-ip 参数是否开启了?没有开启这个参数的时候,vm 关机才会直接删除 IP crd。
再就是看下环境上 logical-switch-port 中 vm pod 对应的名称,是vm 的名称,还是也包含了 pod的名称?
开启keep-vm-ip 参数,lsp的名称应该只包含 vm的名称,而没有 vm pod 的名称

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 7, 2022

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

对于这个现象,需要确认下,keep-vm-ip 参数是否开启了?没有开启这个参数的时候,vm 关机才会直接删除 IP crd。 再就是看下环境上 logical-switch-port 中 vm pod 对应的名称,是vm 的名称,还是也包含了 pod的名称? 开启keep-vm-ip 参数,lsp的名称应该只包含 vm的名称,而没有 vm pod 的名称

1. keep-vm-ip 是默认开启,我们这边也是开启的




[root@iaas-cms-ctrl-1 ~]# k get deployment -n kube-system kube-ovn-controller -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  # ...
  name: kube-ovn-controller

spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: kube-ovn-controller
  strategy:
    rollingUpdate:
      maxSurge: 0%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: kube-ovn-controller
        component: network
        type: infra
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: kube-ovn-controller
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - /kube-ovn/start-controller.sh
        - --default-cidr=10.16.0.0/16
        - --default-gateway=10.16.0.1
        - --default-gateway-check=true
        - --default-logical-gateway=false
        - --default-exclude-ips=
        - --node-switch-cidr=100.64.0.0/16
        - --service-cluster-ip-range=10.96.0.0/12
        - --network-type=geneve
        - --default-interface-name=
        - --default-exchange-link-name=false
        - --default-vlan-id=100
        - --ls-dnat-mod-dl-dst=true
        - --pod-nic-type=veth-pair
        - --enable-lb=true
        - --enable-np=true
        - --enable-eip-snat=true
        - --enable-external-vpc=true
        - --logtostderr=false
        - --alsologtostderr=true
        - --gc-interval=360
        - --inspect-interval=20
        - --log_file=/var/log/kube-ovn/kube-ovn-controller.log
        - --log_file_max_size=0
        - --enable-lb-svc=false
        - --keep-vm-ip=true  # 该配置目前都是默认开启的
        - --pod-default-fip-type=
        - --v=4

2. vm lsp 应该是能准确对应到vm 名称

[root@iaas-cms-ctrl-1 ~]# k get vm -A -o wide
NAMESPACE   NAME                      AGE     STATUS               READY
yiaas       vm-ce7am7q8j5gjlb83p5lg   30h     Stopped              False
yiaas       vm-ce7geua8j5gjlb83p5s0   23h     Stopped              False
yiaas       vm-ce7umpq8j5gkp4t1in6g   7h16m   ErrorUnschedulable   False
yiaas       vm-ce7uuia8j5gkp4t1in8g   6h59m   Running              True
yiaas       vm-ce7vb628j5gkp4t1indg   6h33m   Running              True
yiaas       vm-ce7vpti8j5gkp4t1ino0   6h1m    Running              True
yiaas       vm-ce82n2q8j5gkp4t1io2g   162m    Running              True
yiaas       vm-ce84mrq8j5gkp4t1ioj0   26m     Running              True
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k ko nbctl show | grep -C 2 vm-ce7vpti8j5gkp4t1ino0
        type: router
        router-port: vpc-ce7vii28j5gkp4t1inig-subnet-ce7viu28j5gkp4t1inj0
    port vm-ce7vpti8j5gkp4t1ino0.yiaas.net1.yiaas.ovn
        addresses: ["00:00:00:6D:C8:24 153.6.28.254"]
switch 916fc4dc-34e0-4ea6-9b67-cb5a072ecfe9 (subnet-ce7v8m28j5gkp4t1ina0)
--
        type: router
        router-port: ovn-cluster-ovn-default
    port vm-ce7vpti8j5gkp4t1ino0.yiaas
        addresses: ["00:00:00:93:99:B3 10.6.16.133"]
    port vm-ce7geua8j5gjlb83p5s0.yiaas
[root@iaas-cms-ctrl-1 ~]#



@hongzhen-ma
Copy link
Collaborator

看这个配置是没有问题的,lsp的名称也没有问题。
但是描述的问题现象,又像是没有开启 keep-vm-ip 参数的现象。
可以看下 kube-ovn-cni pod 最开始几行的log,确认下镜像的 commit 点。
我找环境换镜像也试一下

@zbb88888
Copy link
Collaborator

zbb88888 commented Dec 7, 2022

看这个配置是没有问题的,lsp的名称也没有问题。 但是描述的问题现象,又像是没有开启 keep-vm-ip 参数的现象。 可以看下 kube-ovn-cni pod 最开始几行的log,确认下镜像的 commit 点。 我找环境换镜像也试一下

root@iaas-cms-ctrl-1 ~]#  k get daemonset -A -o wide | grep kube-ovn-cni
kube-system      kube-ovn-cni      3         3         3       3            3           kubernetes.io/os=linux   2d1h   cni-server                                           kubeovn/kube-ovn:v1.11.0                                                                                                              app=kube-ovn-cni
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k get po -A -o wide | grep kube-ovn-cni
kube-system            kube-ovn-cni-7gsxf                                 1/1     Running            0                 2d      10.121.33.12    iaas-cms-ctrl-2   <none>           <none>
kube-system            kube-ovn-cni-nsk5n                                 1/1     Running            0                 2d      10.121.33.13    iaas-cms-ctrl-3   <none>           <none>
kube-system            kube-ovn-cni-xq4hn                                 1/1     Running            0                 2d      10.121.33.11    iaas-cms-ctrl-1   <none>           <none>
[root@iaas-cms-ctrl-1 ~]# k logs -f -n kube-system            kube-ovn-cni-7gsxf
setting sysctl variable "net.ipv4.neigh.default.gc_thresh1" to "1024"
net.ipv4.neigh.default.gc_thresh1 = 1024
setting sysctl variable "net.ipv4.neigh.default.gc_thresh2" to "2048"
net.ipv4.neigh.default.gc_thresh2 = 2048
setting sysctl variable "net.ipv4.neigh.default.gc_thresh3" to "4096"
net.ipv4.neigh.default.gc_thresh3 = 4096
setting sysctl variable "net.netfilter.nf_conntrack_tcp_be_liberal" to "1"
net.netfilter.nf_conntrack_tcp_be_liberal = 1
I1205 16:50:58.161898 3737885 cniserver.go:34]
-------------------------------------------------------------------------------
Kube-OVN:
  Version:       v1.11.0
  Build:         2022-12-03_06:43:38
  Commit:        git-86f75c8
  Go Version:    go1.19.3
  Arch:          amd64

@hongzhen-ma
Copy link
Collaborator

找了个 1.10.7 的环境验证了下
企业微信截图_1bebdf42-7c44-4c4e-907b-73f67874c344

删除 vm
企业微信截图_fe367d01-b2b3-49f3-ac66-5d30a6003568

kube-ovn-controller log 中查看到的 gc 记录
企业微信截图_fecd468c-5ea2-45f6-b160-ddb4e434a710

kube-ovn 镜像
企业微信截图_106d0be1-e06b-4ce4-b76d-d5b5695a6b8e

确实是没有能复现出来 issue描述的这个问题

@hongzhen-ma
Copy link
Collaborator

删除 running 状态的 pod 复现了问题,还需要再确认一下

@hongzhen-ma
Copy link
Collaborator

image

删除 ip crd
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants