Skip to content

Fix to prevent consecutive reboots in case of reboot delay. #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 10, 2025

Conversation

hlts2
Copy link
Member

@hlts2 hlts2 commented Apr 9, 2025

WHAT

This PR contains the following changes:

  • Implemented the isLastRebootCmdTimeAfter function, which checks if the last reboot command time for a specified node is after the given threshold time.

WHY

In this case, there was a scenario where the reboot was delayed for over 60 minutes. As a result, the node's LastTransitionTime remained unchanged, causing reboot commands to be issued every time. In other words, since the node's LastTransitionTime didn't change for 60 minutes, the node-agent's check loop continued to trigger, resulting in multiple reboot commands being sent during that time.

To address this, I considered it necessary to use both the node's LastTransitionTime and LastRebootCmdTime (the actual time the reboot command was sent) to handle such cases. This dual comparison ensures that reboot commands are not repeatedly triggered in the event of delays, effectively preventing unnecessary reboots.

@hlts2 hlts2 self-assigned this Apr 9, 2025
@hlts2 hlts2 marked this pull request as ready for review April 9, 2025 12:24
@jokestax
Copy link
Contributor

jokestax commented Apr 9, 2025

Hey @hlts2 ,if we are storing the LastRebootCmdTime,i think we can remove LastTransitionTime

@hlts2
Copy link
Member Author

hlts2 commented Apr 9, 2025

@jokestax

Hi, thank you for your comments. I believe both LastRebootCmdTime and LastTransitionTime are necessary.

For example, in the following case, handling it with just LastRebootCmdTime would not be sufficient:

Condition:

  • LastRebootCmdTime: 2 hours ago (from the current time)
  • Time window: 1 hour

K8s nodes can temporarily become NotReady due to network issues or other factors, meaning they may transition between NotReadyReadyNotReady. In such cases, relying only on LastRebootCmdTime might cause a reboot command to be issued immediately.

What really matters is how long the node has been in the NotReady state. To track this, LastTransitionTime is necessary.

Additionally, after a reboot command is issued, the reboot might be delayed for various reasons. To prevent the same reboot command from being triggered again, it is important to track when the reboot command was last sent. This is where LastRebootCmdTime comes into play, as it prevents the command from being reissued before a certain amount of time has passed.

In summary, we use LastTransitionTime to track how long a node has been in the NotReady state, and LastRebootCmdTime to prevent the same reboot command from being triggered multiple times, especially if the reboot is delayed. So I think this approach enhances the existing logic. What do you think? 🤔

@jokestax
Copy link
Contributor

i got it,below i have listed the cases to make it more clear

 let window = 60 min
 
- LTT > 60 , LRCT < 60 dont reboot
- LTT < 60 , LRCT < 60 dont reboot
- LTT < 60 , LRCT > 60 dont reboot
- LTT > 60, LRCT >. 60 reboot

@hlts2
Copy link
Member Author

hlts2 commented Apr 10, 2025

@jokestax

i got it,below i have listed the cases to make it more clear

OK, Thank you for your comment. 🙏
I have fixed it with the following commit.

e34999a

@hlts2
Copy link
Member Author

hlts2 commented Apr 10, 2025

@jokestax Thank you for your review 🙇 I will merge this PR 🚀

Signed-off-by: hlts2 <[email protected]>
@hlts2 hlts2 merged commit 6b8426a into main Apr 10, 2025
1 check passed
@hlts2 hlts2 deleted the fix/recoard-reboot-time branch April 10, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants