lxd: sync CPU features across LXD cluster for VMs with migration.stateful=true (from Incus) #15351

mihalicyn · 2025-04-08T11:32:21Z

Resubmission of #15114 after testing

JIRA ticket: https://warthogs.atlassian.net/browse/LXD-2078

Signed-off-by: Yueyuanmei Zhang <[email protected]> (cherry picked from commit 6799c405a32e132bf43cf475cfddb8ef0a352b81) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Yueyuanmei Zhang <[email protected]> (cherry picked from commit f240048ec931554ca6440c03a9d58436fc5b57ed) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Gabriel Mougard <[email protected]>

Signed-off-by: Yueyuanmei Zhang <[email protected]> (cherry picked from commit 7399e88f837cbaa0f8e4d31ad740512765efe30e) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Stéphane Graber <[email protected]> Sponsored-by: Luizalabs (https://luizalabs.com) (cherry picked from commit d0b9b9931bbd0d0991d6068c4ef3e2d18657ff1f) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Stéphane Graber <[email protected]> Sponsored-by: Luizalabs (https://luizalabs.com) (cherry picked from commit edd7a4cbf9437e1aeb9e444f1beff4540c88ba63) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Stéphane Graber <[email protected]> Sponsored-by: Luizalabs (https://luizalabs.com) (cherry picked from commit b2be9b9d88fb2b6cf67fb24211740d1fd9bbc7cc) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

Signed-off-by: Stéphane Graber <[email protected]> Sponsored-by: Luizalabs (https://luizalabs.com) (cherry picked from commit 4d43430e13e6f5b0c0c665bde317f8243709c4d3) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

The "kvm64" CPU definition doesn't exist on aarch64 and possibly on other platforms, so restrict the logic to x86_64 for the time being. Signed-off-by: Stéphane Graber <[email protected]> (cherry picked from commit ee24c49a275268f08f9dd060fba1256699fee65e) Signed-off-by: Kadin Sayani <[email protected]> Signed-off-by: Gabriel Mougard <[email protected]> License: Apache-2.0

tomponline · 2025-04-09T07:42:21Z

lxd/daemon.go

@@ -2435,6 +2435,55 @@ func (d *Daemon) heartbeatHandler(w http.ResponseWriter, r *http.Request, isLead

 		logger.Info("Partial heartbeat received", logger.Ctx{"local": localClusterAddress})
 	}
+
+	// Refresh cluster member resource info cache.


This function heartbeatHandler is already quite long and intricate so am not too keen on making it more so.
Please can we split this logic out into a separate function we can call from here instead.
This is something the TIOBE TICS scoring system is also looking for (less complex functions).

tomponline · 2025-04-09T07:45:26Z

lxd/daemon.go

+		}
+
+		go func(name string, address string) {
+			muRefresh.Lock()


I'm quite confused by this lock.
We seem to be starting a go routine per online cluster member (so presumably there is some intention for concurrency), and then we serialise the whole function it via a mutex.

Is the intention to launch a cluster CPU flags refresh task and not hold up the rest of the heartbeatHandler function?

If so, we could move this logic into its own function (as suggested above) and then we can run it via go newfunction() and then have the actual refresh loop done sequentially.

tomponline · 2025-04-09T07:52:12Z

lxd/daemon.go

+			}
+
+			// Connect to the server.
+			client, err := cluster.Connect(address, s.Endpoints.NetworkCert(), s.ServerCert(), nil, true)


Given that the leader initiates heartbeat requests to every cluster member concurrently, does this means that on the hour (after the cache has expired) this section triggers every cluster member to connect to every cluster member to retrieve their CPU flags at the same time?

This sounds like something that might cause regularly periodic load spikes.

Also this hits client.GetServerResources() which as well as the CPU info, pulls a lot more hardware resource info.

tomponline · 2025-04-09T08:16:11Z