Skip to content

xdsclient: update watcher API as per gRFC A88 #7977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 14, 2025

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Jan 2, 2025

This is the first part of implementing gRFC A88 (grpc/proposal#466).

This introduces the new watcher API. This table summarizes the API changes and behavior for each case:

Case Old API New API Behavior
Resource timer fires OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Fail data plane RPCs
LDS or CDS resource deletion OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Drop resource and fail data plane RPCs
xDS channel reports TRANSIENT_FAILURE OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
ADS stream terminates without receiving a response OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Invalid resource update (client NACK) OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Valid resource update OnUpdate(resource) OnResourceChanged(resource) use the new resource

RELEASE NOTES: None

google3 import run analysis did not show any issues

@purnesh42H purnesh42H added Type: Feature New features or improvements in behavior Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jan 2, 2025
@purnesh42H purnesh42H added this to the 1.70 Release milestone Jan 2, 2025
@purnesh42H purnesh42H self-assigned this Jan 2, 2025
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 7 times, most recently from 8d198b7 to a9b45c0 Compare January 3, 2025 16:28
@purnesh42H purnesh42H requested a review from markdroth January 3, 2025 16:58
@purnesh42H purnesh42H assigned dfawley and markdroth and unassigned purnesh42H Jan 3, 2025
@purnesh42H purnesh42H requested a review from dfawley January 3, 2025 16:59
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 2 times, most recently from 4009e3e to 57dbf23 Compare January 6, 2025 12:03
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 80.20305% with 39 lines in your changes missing coverage. Please review.

Project coverage is 82.21%. Comparing base (ce35fd4) to head (c43cd62).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/server/listener_wrapper.go 68.88% 12 Missing and 2 partials ⚠️
xds/internal/resolver/xds_resolver.go 47.82% 10 Missing and 2 partials ⚠️
xds/internal/server/rds_handler.go 61.11% 7 Missing ⚠️
.../balancer/clusterresolver/resource_resolver_eds.go 50.00% 3 Missing and 1 partial ⚠️
xds/internal/testutils/resource_watcher.go 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7977      +/-   ##
==========================================
+ Coverage   82.16%   82.21%   +0.05%     
==========================================
  Files         410      412       +2     
  Lines       40248    40487     +239     
==========================================
+ Hits        33068    33287     +219     
- Misses       5830     5839       +9     
- Partials     1350     1361      +11     
Files with missing lines Coverage Δ
internal/xds/bootstrap/bootstrap.go 65.72% <ø> (ø)
xds/internal/balancer/cdsbalancer/cdsbalancer.go 84.49% <100.00%> (-0.15%) ⬇️
...s/internal/balancer/cdsbalancer/cluster_watcher.go 100.00% <100.00%> (ø)
...rnal/balancer/clusterresolver/resource_resolver.go 94.44% <100.00%> (ø)
xds/internal/resolver/serviceconfig.go 85.71% <100.00%> (ø)
xds/internal/resolver/watch_service.go 100.00% <100.00%> (+8.57%) ⬆️
xds/internal/xdsclient/authority.go 80.38% <100.00%> (+0.60%) ⬆️
xds/internal/xdsclient/clientimpl_watchers.go 84.81% <100.00%> (+0.39%) ⬆️
...nal/xdsclient/xdsresource/cluster_resource_type.go 77.14% <100.00%> (ø)
...l/xdsclient/xdsresource/endpoints_resource_type.go 75.00% <100.00%> (ø)
... and 8 more

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 4 times, most recently from 43c9adb to 89f475a Compare January 6, 2025 12:26
@easwars
Copy link
Contributor

easwars commented Apr 7, 2025

the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource

If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.

@easwars
Copy link
Contributor

easwars commented Apr 7, 2025

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.

We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.

@purnesh42H
Copy link
Contributor Author

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.

We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.

yes, I have added. Thanks.

@purnesh42H
Copy link
Contributor Author

purnesh42H commented Apr 7, 2025

the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource

If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.

yeah if watcher has not seen the valid resource before then its not applicable. The gRFC talks about case when watcher has seen the valid resource but it deleted in xDS client cache https://github.com/grpc/proposal/blob/master/A88-xds-data-error-handling.md#changes-to-xdsclient-watcher-apis.

@easwars easwars assigned purnesh42H and unassigned easwars Apr 7, 2025
@purnesh42H purnesh42H requested a review from easwars April 8, 2025 19:45
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Apr 8, 2025
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Very minor comments this time around. But some action items are there. So, please let's make sure we have them tracked. Thanks.

@@ -474,20 +478,28 @@ func (b *cdsBalancer) onClusterUpdate(name string, update xdsresource.ClusterUpd
// If the security config is invalid, for example, if the provider
// instance is not found in the bootstrap config, we need to put the
// channel in transient failure.
b.onClusterError(name, b.annotateErrorWithNodeID(fmt.Errorf("received Cluster resource contains invalid security config: %v", err)))
b.onClusterResourceError(name, b.annotateErrorWithNodeID(fmt.Errorf("received Cluster resource contains invalid security config: %v", err)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that you unconditionally call onClusterResourceError here while you conditionally call onClusterResourceError or onClusterAmbientError in the two cases below?

I would think of these errors as a data errors too. This is a class of errors which cannot be handled at the xDS client layer (because it doesn't have the required information to do so) but has to be handled at the respective watcher. There are some cases like this for the listener resource as well.

Could you please track this one as well, along with the other items that you are tracking. We should see if we can have a uniform story for this class of errors across languages.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the same check here as well. It doesn't harm? Though please note that we have finished updating the new lastUpdate with new update. So, would it make sense to its call ambient error here because we have the update stored?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though please note that we have finished updating the new lastUpdate with new update

Yeah, we could look into not updating that field in the cases where we rejecting the update in the cds LB policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the same code of checking if the child policy exists or not and calling one of the onClusterXxxError methods is repeated three times. We could pull it into a helper method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -482,7 +482,7 @@ func (r *xdsResolver) onError(err error) {
// are removed.
//
// Only executed in the context of a serializer callback.
func (r *xdsResolver) onResourceNotFound() {
func (r *xdsResolver) onResourceError(err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github doesn't allow me to comment on unchanged lines. But in method applyRouteConfigUpdate, there is a call to onError. I feel that this is also a data error and also belongs to the class of error I was talking about in the comment in xds/internal/balancer/cdsbalancer/cdsbalancer.go.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's kind of weird. I have added the TODO. Its again being called from resource update method which has finished applying the new update. I think we should call resource error here. is it because we have finished applying the new update so we have a cached resource? For now, I have kept it as ambient error to not break existing behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link a github issue with full details in any TODO that is being added, unless you plan to act on the TODO right away in which case you can add your github handle to the todo. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added alias. It will be fast follow from okr tracker

@easwars easwars assigned purnesh42H and unassigned easwars Apr 10, 2025
@purnesh42H purnesh42H requested a review from easwars April 10, 2025 17:32
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Apr 10, 2025
@easwars easwars assigned purnesh42H and unassigned easwars Apr 11, 2025
@purnesh42H purnesh42H merged commit 68205d5 into grpc:master Apr 14, 2025
15 checks passed
janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants