Graceful stop database heartbeats in ICS #55700

GavinFrazar · 2025-06-13T00:42:55Z

Part of: #50237

There is no way to gracefully stop a single inventory control stream heartbeat #50237

Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.

espadolini

Is this going to stop heartbeating individual databases one by one at shutdown time?

api/proto/teleport/legacy/client/proto/inventory.proto

espadolini · 2025-06-18T16:05:31Z

lib/inventory/controller.go

+		return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID)
+	}
+
+	if _, ok := handle.databaseServers[key]; ok {


Shouldn't it be an error to stop a heartbeat that wasn't running?

The client handle may have been lost and re-established, but the heartbeat may not have been re-announced prior to client sending a "Stop" message, so there's no keepalives running for it but the heartbeat still exists in backend.

We could just log an error message here and return early, letting the heartbeat expire on its own, but why not just delete it instead?

Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side.

It's not possible in a single process, because the stop(name) message is only sent when a dynamic db is unregistered.
If a db is updated, we don't send stop(name).
If it's unregistered and then immediately re-registered, the message order would still be stop(name) -> start(name) because gRPC client streams guarantee message ordering of scheduled messages, and we block until the message is scheduled, and our reconciliation loop will handle deletion and recreation in a single goroutine.

This is all non-obvious and somewhat fragile, so I'll just log a debug message (I see no reason to return an error and close the ICS handle) and let the heartbeat expire instead of deleting it

I doubt this edge case will even happen in practice anyway.

Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

I don't think that this edge case is relevant in the code we're discussing, because I think child and parent processes will each have their own control stream.

However, I think you could have a scenario where a forked process is running concurrent with its parent and both are running dynamic db reconciliation loops.
In that case they could both observe events OpDelete(name) ... OpPut(name).
Both would try to send stop(name) -> start(name) messages.
But it's possible that one of the processes exits before sending start(name), so you could get a message order like this:

child: stop(name) child: start(name) parent: stop(name) // parent exits

If this edge case is actually possible, I think we would have to use a conditional delete to handle it, which I doubt we want to do.

It's also a self-healing edge case, so we can just let it happen: the db would disappear temporarily until the child re-announces it after a few minutes.

changed it to just log a message and skip deleting the heartbeat: d682d06

GavinFrazar · 2025-06-18T20:32:46Z

Is this going to stop heartbeating individual databases one by one at shutdown time?

No I intentionally omitted that from the db service shutdown code since we already have the graceful "goodbye" to delete all its heartbeats with one message.

Graceful individual heartbeat stop is used for one case: a dynamic database is unregistered. This is important when buffering rate-limited heartbeats in a follow up PR, to avoid a rate-limit delayed heartbeat from being kept alive forever after a database is unregistered.

espadolini · 2025-06-23T08:07:29Z

lib/srv/db/server.go

+	// the heartbeat may have been created using a fallback method or the
+	// upstream doesn't support graceful stop.


Can't we get rid of the fallback announce at this point, so we'd only need to deal with upstreams not supporting in-band delete?

yeah we can, db heartbeatv2 support was released in 17.0.0 so I was being conservative when I added DELETE IN 19 for the fallback announcer.

actually we can rip out the announcer from all heartbeat v2 types it seems.

To keep this PR focused and small, I'll just stop passing a fallback announcer in NewDatabaseServerHeartbeat.
Then in another PR I'll rip out the announcer entirely.

done

edit: and the comment too: a2acaee

api/proto/teleport/legacy/client/proto/inventory.proto

espadolini · 2025-06-23T08:22:37Z

lib/inventory/controller.go

+		return trace.AccessDenied("incorrect database server ID (expected %q, got %q)", handle.Hello().ServerID, key.hostID)
+	}
+
+	if _, ok := handle.databaseServers[key]; ok {


Is it possible that the heartbeat will get stopped while another copy of the heartbeat has just started heartbeating? I'm not seeing "ownership" of a particular name on the HeartbeatV2 side or the db.Server side. Could this also happen if two competing instances of Teleport are running, say, during an upgrade?

Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.

GavinFrazar added the no-changelog Indicates that a PR does not require a changelog entry label Jun 13, 2025

github-actions bot added database-access Database access related issues and PRs size/md labels Jun 13, 2025

github-actions bot requested review from EdwardDowling and greedy52 June 13, 2025 00:43

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 16de2df to d969dbb Compare June 13, 2025 01:05

GavinFrazar requested review from espadolini and rosstimothy and removed request for EdwardDowling June 13, 2025 01:29

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from d969dbb to 8248c04 Compare June 13, 2025 01:41

espadolini reviewed Jun 18, 2025

View reviewed changes

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 8248c04 to 4ff53d0 Compare June 21, 2025 00:56

espadolini reviewed Jun 23, 2025

View reviewed changes

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 4ff53d0 to 9fc96d4 Compare June 23, 2025 21:44

Graceful stop database heartbeats in ICS

8645241

Support stopping an individual database heartbeat in the inventory control stream instead of waiting for multiple keepalives to fail.

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 9fc96d4 to 8645241 Compare June 23, 2025 22:22

Log unexpected stop db message

d682d06

GavinFrazar force-pushed the gavinfrazar/stop-ics-hb branch from 705ecdd to d682d06 Compare June 23, 2025 22:36

update comment to remove mention of fallback upserts

a2acaee

GavinFrazar mentioned this pull request Jun 24, 2025

Rate limit DB server heartbeats #56022

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graceful stop database heartbeats in ICS #55700

Graceful stop database heartbeats in ICS #55700

Uh oh!

GavinFrazar commented Jun 13, 2025 •

edited

Loading

Uh oh!

espadolini left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

espadolini Jun 18, 2025

Uh oh!

GavinFrazar Jun 18, 2025

Uh oh!

espadolini Jun 23, 2025

Uh oh!

GavinFrazar Jun 23, 2025

Uh oh!

GavinFrazar Jun 23, 2025

Uh oh!

GavinFrazar commented Jun 18, 2025 •

edited

Loading

Uh oh!

espadolini Jun 23, 2025

Uh oh!

GavinFrazar Jun 23, 2025

Uh oh!

GavinFrazar Jun 23, 2025

Uh oh!

GavinFrazar Jun 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

espadolini Jun 23, 2025

Uh oh!

Uh oh!

		// the heartbeat may have been created using a fallback method or the
		// upstream doesn't support graceful stop.

Graceful stop database heartbeats in ICS #55700

Are you sure you want to change the base?

Graceful stop database heartbeats in ICS #55700

Uh oh!

Conversation

GavinFrazar commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GavinFrazar commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GavinFrazar Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GavinFrazar commented Jun 13, 2025 •

edited

Loading

GavinFrazar commented Jun 18, 2025 •

edited

Loading

GavinFrazar Jun 23, 2025 •

edited

Loading