Skip to content

Introduce XdsClusterManager to cache Clusters #6261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 23, 2025

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented May 29, 2025

Motivation:

This changeset can be thought of as a refactoring for the next PR #6096.
Functionally, there should be no changes.

Cluster caching

Currently, a ClusterManager is created every time a new XdsEndpointGroup is created.

While this was easy to implement at the time, this poses a problem where each XdsEndpointGroup creates a backing EndpointGroup such as HealthCheckedEndpointGroup or DnsAddressEndpointGroup. Moreover, even having multiple clusters contained in different routes would also consume resources. (e.g. route1 with cluster1 and route2 with cluster1. Although cluster1 is the same, they are contained in a different RouteSnapshot and thus handled twice)

A cluster is expected to be uniquely identified by its name as specified in the documentation, and this is reflected in the upstream implementation as well. ref

I propose that for each XdsBootstrap, clusters and the resources attached to them are cached by name. The centralized location for managing these clusters is named XdsClusterManager after ClusterManager.

In the future, XdsClusterManager will take over the responsibility of the current ClusterManager and also manage resources (health checking, dns, etc..) associated with clusters. While caching resources isn't done in this PR,
the following POC PR can be checked for reference on how this will be done.

XdsClusterManager initialization

The changes in this PR also try to align closer to the upstream implementation.

There is an inherent dependency between adding clusters and creating a bootstrap. Bootstrap clusters should be created on initialization; yet bootstrap clusters may require connection to a control plane (which may require other bootstraps). In order to resolve this, a couple of refactoring efforts have taken place.

The initialization process of XdsClusterManager has been modified to closely follows that of upstream. This guarantees that bootstrapped clusters will initialize correctly irregardless of the declaration order and matches upstream behavior. (ref). Essentially, start up is done in two phases 1) non-eds cluster initialization 2) eds cluster initialization. This is done since it is possible that eds clusters require a bootstrap to connect to a control plane (relying on non-eds clusters).

Another implication of the change above is that a bootstrap is not created when bootstrap clusters are initialized. In order to resolve this circular dependency, the logic where remote subscription (to the control plane) takes place has been abstracted to a ControlPlaneClientManager. Now that subscription doesn't require a bootstrap and requires a ControlPlaneClientManager, a new SubscriptionContext has been introduced to allow for easier subscription. Since non-eds bootstrap clusters don't require connection to a control plane, a separate StaticSubscriptionContext is used.

Modifications:

  • Cluster caching related changes
    • Primer related logic has been removed and replaced with a throw-away watcher based approach. The reasoning is that primers are not reliable anymore given that Clusters are now cached.
    • ResourceNodes now support multiple snapshot watchers. This change is made so that XdsClusterManager can determine if a cluster no longer has to be cached depending on the watchers.
  • XdsClusterManager initialization related changes
    • ControlPlaneClientManager is introduced which manages ConfigSourceClients
    • SubscriptionContext is introduced to remove the circular dependency between ResourceNodes and bootstrap.
    • BootstrapClusters is now initialized in two phases
  • Misc) Creating a ClusterRoot or ListenerRoot can cause logic in the ClusterManager to be executed from a non-eventloop. Modified the logic so that user threads cannot modify ClusterManager or the internal bootstrap state.
  • Misc) ClusterXdsResource#upstreamTlsContext is now created once and cached

Result:

@jrhee17 jrhee17 added this to the 1.33.0 milestone May 29, 2025
@jrhee17 jrhee17 marked this pull request as ready for review May 29, 2025 10:09
@jrhee17
Copy link
Contributor Author

jrhee17 commented May 30, 2025

ready for review

@minwoox minwoox requested a review from Copilot June 19, 2025 05:35
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the xDS resource management by introducing an XdsClusterManager to cache and reuse Cluster resources, while also updating the subscription context and resource node implementations to align with upstream behavior.

  • Introduces a new SubscriptionContext and DefaultSubscriptionContext replacing direct usage of XdsBootstrapImpl.
  • Refactors resource node classes and associated watchers for Route, Listener, Cluster, and Endpoint resources.
  • Updates static resources and bootstrap cluster initialization to support centralized cluster caching.

Reviewed Changes

Copilot reviewed 20 out of 27 changed files in this pull request and generated no comments.

Show a summary per file
File Description
xds/StaticSubscriptionContext.java Introduces a static subscription context with unsupported operations throwing exceptions.
xds/StaticResourceUtils.java Replaces the older XdsBootstrapImpl parameter with SubscriptionContext in utility methods for static resource creation.
xds/RouteXdsResource.java Removes the primer field and updates equality/hashCode to reflect changes.
xds/RouteResourceNode.java Refactors resource node watcher creation and closure to update virtual host snapshots.
xds/ListenerRoot.java & ListenerResourceNode.java Adjusts listener initialization and subscription to use SubscriptionContext and executes updates on the event loop.
xds/EndpointXdsResource.java & EndpointResourceNode.java Updates endpoint resource implementations to remove primer support and use SubscriptionContext.
xds/DefaultSubscriptionContext.java, ControlPlaneClientManager.java Introduces default subscription context and client management for control plane interactions.
xds/ClusterXdsResource.java, ClusterSnapshot.java, ClusterRoot.java, ClusterResourceNode.java Refactors cluster-related resource nodes, caching upstream TLS context and updating snapshot handling.
xds/BootstrapClusters.java Updates bootstrap cluster initialization to register static clusters via the new cluster manager.
xds/AbstractRoot.java, AbstractResourceNode.java Adjusts abstract resource node implementations to support SubscriptionContext and watcher management.
Comments suppressed due to low confidence (2)

xds/src/main/java/com/linecorp/armeria/xds/ListenerRoot.java:42

  • Consider whether scheduling the close operation asynchronously on the event loop meets the desired resource release guarantees. If immediate cleanup is required, evaluate synchronizing the close logic or awaiting the execution completion.
            eventLoop().execute(() -> node.onChanged(listenerXdsResource));

xds/src/main/java/com/linecorp/armeria/xds/AbstractResourceNode.java:49

  • [nitpick] The checkArgument conditions for dynamic and static nodes improve safety, but consider adding comments or more descriptive error messages to document the invariants expected for configSource values in each node type.
        if (resourceNodeType == ResourceNodeType.DYNAMIC) {

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much simpler now. 👍 👍 👍


void secondaryInitialize(SubscriptionContext context) {
for (Cluster cluster: bootstrap.getStaticResources().getClustersList()) {
if (!cluster.hasEdsClusterConfig()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializeEdsClusters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the naming secondary as it is a widely used term upstream. (ref)

Let me know if you feel strongly about this though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about it. Let's use the term secondary, then.

final XdsType type = node.type();
final String resourceName = node.name();
final ConfigSourceClient client = clientMap.get(node.configSource());
if (client != null && client.removeSubscriber(type, resourceName, node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to use assert client != null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since updates are event-loop based, I imagined that a call tounsubscribe can be queued after ControlPlaneClientManager#close is called.
As there is no way to immediately know the state of the manager, I preferred to add a conditional statement instead of an assertion.
Let me know if you still prefer that an assertion is done though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't think of the case. Let's keep it as is.

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. Left only minor comments.

notifyOnError(type, error);
}

void notifyOnError(XdsType type, Status error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) Should we make this method private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is intended to be called from subclasses. Let me know if I misunderstood your intention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me go ahead and merge this PR. If you have additional comments, let me follow up in the next iteration.

@jrhee17 jrhee17 merged commit d190370 into line:main Jun 23, 2025
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants