-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Graceful stop database heartbeats in ICS #55700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
16de2df
to
d969dbb
Compare
d969dbb
to
8248c04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to stop heartbeating individual databases one by one at shutdown time?
lib/inventory/controller.go
Outdated
return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID) | ||
} | ||
|
||
if _, ok := handle.databaseServers[key]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be an error to stop a heartbeat that wasn't running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client handle may have been lost and re-established, but the heartbeat may not have been re-announced prior to client sending a "Stop" message, so there's no keepalives running for it but the heartbeat still exists in backend.
We could just log an error message here and return early, letting the heartbeat expire on its own, but why not just delete it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side.
It's not possible in a single process, because the stop(name)
message is only sent when a dynamic db
is unregistered.
If a db
is updated, we don't send stop(name)
.
If it's unregistered and then immediately re-registered, the message order would still be stop(name) -> start(name)
because gRPC client streams guarantee message ordering of scheduled messages, and we block until the message is scheduled, and our reconciliation loop will handle deletion and recreation in a single goroutine.
This is all non-obvious and somewhat fragile, so I'll just log a debug message (I see no reason to return an error and close the ICS handle) and let the heartbeat expire instead of deleting it
I doubt this edge case will even happen in practice anyway.
Could this also happen if two competing instances of Teleport are running, say, during an upgrade?
I don't think that this edge case is relevant in the code we're discussing, because I think child and parent processes will each have their own control stream.
However, I think you could have a scenario where a forked process is running concurrent with its parent and both are running dynamic db reconciliation loops.
In that case they could both observe events OpDelete(name) ... OpPut(name)
.
Both would try to send stop(name) -> start(name)
messages.
But it's possible that one of the processes exits before sending start(name)
, so you could get a message order like this:
child: stop(name)
child: start(name)
parent: stop(name)
// parent exits
If this edge case is actually possible, I think we would have to use a conditional delete to handle it, which I doubt we want to do.
It's also a self-healing edge case, so we can just let it happen: the db would disappear temporarily until the child re-announces it after a few minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to just log a message and skip deleting the heartbeat: d682d06
No I intentionally omitted that from the db service shutdown code since we already have the graceful "goodbye" to delete all its heartbeats with one message. Graceful individual heartbeat stop is used for one case: a dynamic database is unregistered. This is important when buffering rate-limited heartbeats in a follow up PR, to avoid a rate-limit delayed heartbeat from being kept alive forever after a database is unregistered. |
8248c04
to
4ff53d0
Compare
lib/srv/db/server.go
Outdated
// the heartbeat may have been created using a fallback method or the | ||
// upstream doesn't support graceful stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we get rid of the fallback announce at this point, so we'd only need to deal with upstreams not supporting in-band delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can, db heartbeatv2 support was released in 17.0.0 so I was being conservative when I added DELETE IN 19
for the fallback announcer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we can rip out the announcer from all heartbeat v2 types it seems.
To keep this PR focused and small, I'll just stop passing a fallback announcer in NewDatabaseServerHeartbeat
.
Then in another PR I'll rip out the announcer entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/inventory/controller.go
Outdated
return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID) | ||
} | ||
|
||
if _, ok := handle.databaseServers[key]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?
4ff53d0
to
9fc96d4
Compare
Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.
9fc96d4
to
8645241
Compare
705ecdd
to
d682d06
Compare
Part of: #50237
Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.