Skip to content

Frequent "offline" status causing subnet router re-election and connection disruptions #1501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vsychov opened this issue Jun 30, 2023 · 11 comments
Labels
bug Something isn't working well described ❤️ Issues that are well described and researched, making it easier for maintainers.
Milestone

Comments

@vsychov
Copy link
Contributor

vsychov commented Jun 30, 2023

Hello,

I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions.

It appears that the problem lies in how a machine is set to "offline" mode, using the last_seen field in the database. A machine goes offline when the last_seen field reaches a value of 60 seconds (keepAliveInterval). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected.

It looks like field last_seen updated in keepAliveTicker and few other places, and it's happens each 40-60 seconds in my setup, that's not enough.

From what I can see, this problem could be solved by updating the last_seen field in the updateCheckerTicker (which by default occurs every 10 seconds - NodeUpdateCheckInterval), simply by adding:

machine.LastSeen = &now

right after:

machine.LastSuccessfulUpdate = &now

I hope this suggestion is helpful and look forward to any feedback.

Thank you

@vsychov vsychov added the bug Something isn't working label Jun 30, 2023
@gbraad gbraad mentioned this issue Jul 2, 2023
2 tasks
@kradalby
Copy link
Collaborator

kradalby commented Jul 7, 2023

This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.

@kradalby kradalby added the well described ❤️ Issues that are well described and researched, making it easier for maintainers. label Sep 24, 2023
Copy link
Contributor

This issue is stale because it has been open for 90 days with no activity.

@github-actions github-actions bot added the stale label Dec 24, 2023
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@andreyrd
Copy link

This is still an active issue in the latest stable version.
Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?

@kradalby kradalby reopened this Jan 19, 2024
@kradalby kradalby reopened this Jan 19, 2024
@kradalby kradalby added this to the v0.23.0 milestone Jan 19, 2024
@kradalby
Copy link
Collaborator

@andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this.

I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.

@github-actions github-actions bot removed the stale label Jan 20, 2024
@kradalby
Copy link
Collaborator

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@eNdiD
Copy link

eNdiD commented Feb 27, 2024

@kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them.

Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue.

Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh last seen value, like one minute ago.

Update2: I believe it can be reproduced by switching networks. Like the next scenario:

  1. Activate Tailscale on Android while being on the home Wi-Fi. Node stays online
  2. Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online
  3. Turn on Wi-Fi. Node goes offline, last seen value continues to update

@fortitudepub
Copy link
Contributor

fortitudepub commented Mar 19, 2024

I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.

@kradalby
Copy link
Collaborator

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@vsychov
Copy link
Contributor Author

vsychov commented Apr 18, 2024

Thanks @kradalby , I'll make tests today or tomorrow

@kradalby
Copy link
Collaborator

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working well described ❤️ Issues that are well described and researched, making it easier for maintainers.
Projects
None yet
Development

No branches or pull requests

5 participants