Fix to prevent consecutive reboots in case of reboot delay. #23

hlts2 · 2025-04-09T10:53:49Z

WHAT

This PR contains the following changes:

Implemented the isLastRebootCmdTimeAfter function, which checks if the last reboot command time for a specified node is after the given threshold time.

WHY

In this case, there was a scenario where the reboot was delayed for over 60 minutes. As a result, the node's LastTransitionTime remained unchanged, causing reboot commands to be issued every time. In other words, since the node's LastTransitionTime didn't change for 60 minutes, the node-agent's check loop continued to trigger, resulting in multiple reboot commands being sent during that time.

To address this, I considered it necessary to use both the node's LastTransitionTime and LastRebootCmdTime (the actual time the reboot command was sent) to handle such cases. This dual comparison ensures that reboot commands are not repeatedly triggered in the event of delays, effectively preventing unnecessary reboots.

Signed-off-by: hlts2 <[email protected]>

jokestax · 2025-04-09T13:05:59Z

Hey @hlts2 ,if we are storing the LastRebootCmdTime,i think we can remove LastTransitionTime

hlts2 · 2025-04-09T16:31:10Z

@jokestax

Hi, thank you for your comments. I believe both LastRebootCmdTime and LastTransitionTime are necessary.

For example, in the following case, handling it with just LastRebootCmdTime would not be sufficient:

Condition:

LastRebootCmdTime: 2 hours ago (from the current time)
Time window: 1 hour

K8s nodes can temporarily become NotReady due to network issues or other factors, meaning they may transition between NotReady → Ready → NotReady. In such cases, relying only on LastRebootCmdTime might cause a reboot command to be issued immediately.

What really matters is how long the node has been in the NotReady state. To track this, LastTransitionTime is necessary.

Additionally, after a reboot command is issued, the reboot might be delayed for various reasons. To prevent the same reboot command from being triggered again, it is important to track when the reboot command was last sent. This is where LastRebootCmdTime comes into play, as it prevents the command from being reissued before a certain amount of time has passed.

In summary, we use LastTransitionTime to track how long a node has been in the NotReady state, and LastRebootCmdTime to prevent the same reboot command from being triggered multiple times, especially if the reboot is delayed. So I think this approach enhances the existing logic. What do you think? 🤔

jokestax · 2025-04-10T06:58:19Z

i got it,below i have listed the cases to make it more clear

 let window = 60 min
 
- LTT > 60 , LRCT < 60 dont reboot
- LTT < 60 , LRCT < 60 dont reboot
- LTT < 60 , LRCT > 60 dont reboot
- LTT > 60, LRCT >. 60 reboot

pkg/watcher/watcher.go

Signed-off-by: hlts2 <[email protected]>

hlts2 · 2025-04-10T07:28:17Z

@jokestax

i got it,below i have listed the cases to make it more clear

OK, Thank you for your comment. 🙏
I have fixed it with the following commit.

e34999a

hlts2 · 2025-04-10T07:28:42Z

@jokestax Thank you for your review 🙇 I will merge this PR 🚀

Signed-off-by: hlts2 <[email protected]>

hlts2 added 6 commits April 9, 2025 19:42

fix: add isLastRebootTimeAfter method

8963374

Signed-off-by: hlts2 <[email protected]>

feat: add NOTE comment

9324efe

Signed-off-by: hlts2 <[email protected]>

feat: add comment and log

4e405a7

Signed-off-by: hlts2 <[email protected]>

fix: bugfix condition check

cd71a4d

Signed-off-by: hlts2 <[email protected]>

fix: log comment

1fd59a7

Signed-off-by: hlts2 <[email protected]>

fix: variable name

354a2fc

Signed-off-by: hlts2 <[email protected]>

hlts2 self-assigned this Apr 9, 2025

hlts2 added 2 commits April 9, 2025 20:58

feat: add test case

66e20dc

Signed-off-by: hlts2 <[email protected]>

feat: add test case for isLastRebootCommandTimeAfter method

7a81d45

Signed-off-by: hlts2 <[email protected]>

hlts2 marked this pull request as ready for review April 9, 2025 12:24

hlts2 requested review from jokestax, johndietz and jarededwards April 9, 2025 12:27

jokestax reviewed Apr 10, 2025

View reviewed changes

pkg/watcher/watcher.go Show resolved Hide resolved

jokestax approved these changes Apr 10, 2025

View reviewed changes

fix: add comment based on feedback

e34999a

Signed-off-by: hlts2 <[email protected]>

fix: fix comment

44c44cd

Signed-off-by: hlts2 <[email protected]>

jokestax approved these changes Apr 10, 2025

View reviewed changes

hlts2 merged commit 6b8426a into main Apr 10, 2025
1 check passed

hlts2 deleted the fix/recoard-reboot-time branch April 10, 2025 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix to prevent consecutive reboots in case of reboot delay. #23

Fix to prevent consecutive reboots in case of reboot delay. #23

Uh oh!

hlts2 commented Apr 9, 2025 •

edited

Loading

Uh oh!

jokestax commented Apr 9, 2025

Uh oh!

hlts2 commented Apr 9, 2025

Uh oh!

jokestax commented Apr 10, 2025

Uh oh!

Uh oh!

hlts2 commented Apr 10, 2025

Uh oh!

hlts2 commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Fix to prevent consecutive reboots in case of reboot delay. #23

Fix to prevent consecutive reboots in case of reboot delay. #23

Uh oh!

Conversation

hlts2 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WHAT

WHY

Uh oh!

jokestax commented Apr 9, 2025

Uh oh!

hlts2 commented Apr 9, 2025

Uh oh!

jokestax commented Apr 10, 2025

Uh oh!

Uh oh!

hlts2 commented Apr 10, 2025

Uh oh!

hlts2 commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

hlts2 commented Apr 9, 2025 •

edited

Loading