Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deleting FIP overwrites EIP ready state #5114

Open
cruickshankpg opened this issue Apr 1, 2025 · 1 comment
Open

[BUG] Deleting FIP overwrites EIP ready state #5114

cruickshankpg opened this issue Apr 1, 2025 · 1 comment
Labels
bug Something isn't working eip

Comments

@cruickshankpg
Copy link

cruickshankpg commented Apr 1, 2025

Kube-OVN Version

v1.13.4

Kubernetes Version

v1.28.6

Operation-system/Kernel Version

"Ubuntu 22.04.5 LTS" 6.8.0-47-generic

Description

Deleting a FIP triggers a reset of the associated EIP. Resetting an EIP sets status.ready: true even if the EIP is not yet ready. The EIP update handler then does not program the EIP on the NAT GW.

We hit this following a NAT GW being rescheduled to a new node, this triggers all FIPs and EIPs to be marked as not ready so they can be programmed on the new pod. A FIP was deleted and recreated while all this was going on but the associated EIP was never added to the new gateway pod.

Steps To Reproduce

It's a race condition so it can be difficult to consistently recreate this bug.

  1. Create vpc nat gw
  2. Create a lot of EIPs and FIPs on vpc nat gw
  3. Delete vpc nat gw pod
  4. Delete all FIPs
  5. Exec onto vpc nat gw pod and check ip addresses on net1

Current Behavior

Deleting a FIP while an EIP is not yet programmed prevents the EIP ever being programmed.

Expected Behavior

Deleting a FIP while an EIP is not yet programmed has no impact on EIP being programmed.

@cruickshankpg cruickshankpg added the bug Something isn't working label Apr 1, 2025
@dosubot dosubot bot added the eip label Apr 1, 2025
Copy link

dosubot bot commented Apr 1, 2025

Deleting a FIP in Kube-OVN v.13.4 affects the ready state of an associated EIP because the EIP's status is updated based on the NAT rules that use it. When a FIP is deleted, the NAT rules associated with that FIP are also removed. This change is detected by the isOvnEipInUse function, which checks for the presence of NAT rules (DNAT, FIP, SNAT) using the EIP. If no NAT rules are found, the EIP's status is updated to reflect that it is no longer in use, which can prevent it from being programmed on the NAT Gateway [1][2].

The issue you're encountering seems to be related to a race condition where the EIP's status is incorrectly set to ready: true even when it is not yet ready, preventing it from being programmed on the NAT Gateway. This could be due to the timing of the FIP deletion and the NAT Gateway rescheduling. The system should ideally handle such race conditions more gracefully to ensure that EIPs are correctly programmed after such events.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working eip
Projects
None yet
Development

No branches or pull requests

1 participant