Skip to content

Graceful stop database heartbeats in ICS #55700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

GavinFrazar
Copy link
Contributor

@GavinFrazar GavinFrazar commented Jun 13, 2025

Part of: #50237

Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.

@GavinFrazar GavinFrazar added the no-changelog Indicates that a PR does not require a changelog entry label Jun 13, 2025
@github-actions github-actions bot added database-access Database access related issues and PRs size/md labels Jun 13, 2025
@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 16de2df to d969dbb Compare June 13, 2025 01:05
@GavinFrazar GavinFrazar requested review from espadolini and rosstimothy and removed request for EdwardDowling June 13, 2025 01:29
@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from d969dbb to 8248c04 Compare June 13, 2025 01:41
Copy link
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to stop heartbeating individual databases one by one at shutdown time?

return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID)
}

if _, ok := handle.databaseServers[key]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be an error to stop a heartbeat that wasn't running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client handle may have been lost and re-established, but the heartbeat may not have been re-announced prior to client sending a "Stop" message, so there's no keepalives running for it but the heartbeat still exists in backend.

We could just log an error message here and return early, letting the heartbeat expire on its own, but why not just delete it instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side.

It's not possible in a single process, because the stop(name) message is only sent when a dynamic db is unregistered.
If a db is updated, we don't send stop(name).
If it's unregistered and then immediately re-registered, the message order would still be stop(name) -> start(name) because gRPC client streams guarantee message ordering of scheduled messages, and we block until the message is scheduled, and our reconciliation loop will handle deletion and recreation in a single goroutine.

This is all non-obvious and somewhat fragile, so I'll just log a debug message (I see no reason to return an error and close the ICS handle) and let the heartbeat expire instead of deleting it

I doubt this edge case will even happen in practice anyway.

Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

I don't think that this edge case is relevant in the code we're discussing, because I think child and parent processes will each have their own control stream.

However, I think you could have a scenario where a forked process is running concurrent with its parent and both are running dynamic db reconciliation loops.
In that case they could both observe events OpDelete(name) ... OpPut(name).
Both would try to send stop(name) -> start(name) messages.
But it's possible that one of the processes exits before sending start(name), so you could get a message order like this:

child: stop(name)
child: start(name)
parent: stop(name)
// parent exits

If this edge case is actually possible, I think we would have to use a conditional delete to handle it, which I doubt we want to do.

It's also a self-healing edge case, so we can just let it happen: the db would disappear temporarily until the child re-announces it after a few minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to just log a message and skip deleting the heartbeat: d682d06

@GavinFrazar
Copy link
Contributor Author

GavinFrazar commented Jun 18, 2025

Is this going to stop heartbeating individual databases one by one at shutdown time?

No I intentionally omitted that from the db service shutdown code since we already have the graceful "goodbye" to delete all its heartbeats with one message.

Graceful individual heartbeat stop is used for one case: a dynamic database is unregistered. This is important when buffering rate-limited heartbeats in a follow up PR, to avoid a rate-limit delayed heartbeat from being kept alive forever after a database is unregistered.

@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 8248c04 to 4ff53d0 Compare June 21, 2025 00:56
Comment on lines 834 to 820
// the heartbeat may have been created using a fallback method or the
// upstream doesn't support graceful stop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get rid of the fallback announce at this point, so we'd only need to deal with upstreams not supporting in-band delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can, db heartbeatv2 support was released in 17.0.0 so I was being conservative when I added DELETE IN 19 for the fallback announcer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can rip out the announcer from all heartbeat v2 types it seems.

To keep this PR focused and small, I'll just stop passing a fallback announcer in NewDatabaseServerHeartbeat.
Then in another PR I'll rip out the announcer entirely.

Copy link
Contributor Author

@GavinFrazar GavinFrazar Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

edit: and the comment too: a2acaee

return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID)
}

if _, ok := handle.databaseServers[key]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 4ff53d0 to 9fc96d4 Compare June 23, 2025 21:44
Support stopping an individual database heartbeat in the inventory
control stream instead of waiting for multiple keepalives to fail.
@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 9fc96d4 to 8645241 Compare June 23, 2025 22:22
@GavinFrazar GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 705ecdd to d682d06 Compare June 23, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database-access Database access related issues and PRs no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants