-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Remove failed nodes from serfWAN #6028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Just a drive-by, blind review here. Definitely ignore my comments if they aren't relevant or helpful.
Also, maybe a test for this new behavior would be good!
Cheers! Thanks for inviting me / letting me participate :)
@@ -22,7 +22,8 @@ type Delegate interface { | |||
NotifyHealth(OperatorHealthReply) | |||
PromoteNonVoters(*Config, OperatorHealthReply) ([]raft.Server, error) | |||
Raft() *raft.Raft | |||
Serf() *serf.Serf | |||
SerfLAN() *serf.Serf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who implements this interface? I can't find much in the codebase with grep, which makes me think it's the customer or external user who implements this?
I think this looks like an API-breaking change, since the type is exported we're changing the interface definition. Is that acceptable? How do you guys usually handle breaking changes to public interfaces?
Apologies if this comment seems irrelevant! Still gaining project context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This interface is implemented by AutopilotDelegate, which is used in the agent/consul
pkg to initialize a new NewAutopilot.
External users shouldn't be implementing this, they should only be interacting with consul via the api pkg.
agent/consul/autopilot/autopilot.go
Outdated
//is failed - then check to see if it is a serfLAN | ||
//member, remove it. | ||
func (a *Autopilot) pruneSerfWAN(serfWAN *serf.Serf, serfLAN *serf.Serf) { | ||
for _, memberWAN := range serfWAN.Members() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have very limited context about what this code is doing, so totally ignore this if it seems inappropriate or not important!
This looks like it could potentially be an expensive operation with the 2 nested for loops. Maybe this isn't super important, but maybe we can make the algorithm more efficient? Perhaps by caching something from the first or second for loop, and then just having 2 side-by-side loops? Maybe first iterating through either the serfLAN or serfWAN members and caching our findings in a way that is easy to look up (like in a map)? (making the algorithm go from O(n^2) => O(2n)).
Let me know if I’m not making sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey -
It was super expensive (I whined about this function and it's complexity, as well as it's poor readability imo) - I wouldn't want to cache anything for Serf due to the possibility of harming our results.
What this is s'posed to be doing is retrieving all the Serf members from WAN and LAN, comparing the results, and removing any failed members. The reason we did it like this was because we wanted to find any members in LAN after they have been removed, and if we find one then we need to remove it from WAN. We were trying to not do the calculation for Raft twice (where we make sure we can still maintain a consensus).
Luckily the new implementation gets rid of this function entirely. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, almost there!
@@ -22,7 +22,8 @@ type Delegate interface { | |||
NotifyHealth(OperatorHealthReply) | |||
PromoteNonVoters(*Config, OperatorHealthReply) ([]raft.Server, error) | |||
Raft() *raft.Raft | |||
Serf() *serf.Serf | |||
SerfLAN() *serf.Serf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This interface is implemented by AutopilotDelegate, which is used in the agent/consul
pkg to initialize a new NewAutopilot.
External users shouldn't be implementing this, they should only be interacting with consul via the api pkg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @s-christoff 🎉
Co-Authored-By: Paul Banks <[email protected]>
I happened to look at it again, and it looks great! 👍 |
Autopilot primarily focuses on serfLAN and places failed servers into a left state in that gossip pool, but does not do this for serfWAN.
Whenever serfWAN is in use, we should be removing any failed node from the gossip pool to create consistency between serfWAN and serfLAN.
In the
force-leave
command it is important to check if when passing nodenames to serfWAN it is formatted in the proper way (nodename.datacenter) or it will not properly be removed.Fixes #3361