Skip to content

xdsclient: make sending requests more deterministic #7774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 25, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 24, 2024

This PR fixes a flaky test https://github.com/grpc/grpc-go/actions/runs/11499180098/job/32006581739?pr=7773, where the following happens:

  • CDS LB policy requests for a cluster resource which turns out to be an aggregate cluster.
  • CDS LB policy then requests for the child clusters, which turn out to leaf clusters of type EDS.
  • CDS LB policy generates child policy config for cluster_resolver with discovery mechanisms for the above EDS clusters.
  • cluster_resolver does not handle its config update inline. Instead it pushes the config update on to a channel which is then read by a goroutine.
  • When the config update is eventually picked up the goroutine, the cluster_resolver LB policy ends up requesting for EDS resources corresponding to the discovery mechanisms in its config.

But the asynchronous behavior of cluster_resolver means that CDS LB policy considers the local processing of the response containing the child clusters to be complete, and therefore invokes the done callback on the xds client to unblock the receive side flow control on the ADS stream. This requires a fix in the cluster_resolver LB policy.

But the above failure also uncovered an issue with the new ADS stream implementation. Consider the following sequence of events:

  • A subscription request for resource A, reaches the ADS stream implementation. It pushes this onto a channel and returns.
  • A subscription request for resource B, reaches the ADS stream implementation. It pushes this onto a channel and returns.
  • A runner goroutine reads the first message off the channel, but it sees a subscription for both resources A and B, and it ends up sending them both.
  • The same runner goroutine reads the second message off the channel, and again sees a subscription for both resources A and B, and it ends up sending them both.
  • While this is not a correctness issue, it causes flakes in tests.

What this PR does:

  • It adds a pending_write bit to the resource-type specific state maintained by the ADS stream implementation
  • The bit it set to true every time a resource is being subscribed or unsubscribed
  • And whenever resources of a particular type are sent out, the bit is set back to false
  • This guarantees that in the above described scenario, only one request message is sent out for both resources

#a71-xds-fallback
#xdsclient-refactor
Addresses #6902

RELEASE NOTES: none

@easwars easwars requested a review from zasweq October 24, 2024 20:32
@easwars easwars added this to the 1.68 Release milestone Oct 24, 2024
@easwars easwars requested a review from purnesh42H October 24, 2024 20:32
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 64.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 81.34%. Comparing base (f8e5d8f) to head (c0c1456).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/transport/ads/ads_stream.go 64.00% 6 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7774      +/-   ##
==========================================
+ Coverage   81.32%   81.34%   +0.01%     
==========================================
  Files         368      368              
  Lines       36779    36782       +3     
==========================================
+ Hits        29912    29919       +7     
+ Misses       5635     5626       -9     
- Partials     1232     1237       +5     
Files with missing lines Coverage Δ
xds/internal/xdsclient/transport/ads/ads_stream.go 69.49% <64.00%> (+1.07%) ⬆️

... and 16 files with indirect coverage changes

@zasweq
Copy link
Contributor

zasweq commented Oct 25, 2024

Glad this fixes ADS Stream flake, and glad we got a temporary fix in for the CDS processing stuff. All these tests/handling of resources will be moved the resolver eventually, but looks like it'll be next year since I don't see it on our OKR's anymore.

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

// watch timers are started for the resources in the request.
//
// Caller needs to hold c.mu.
func (s *StreamImpl) sendMessageIfWritePending(stream transport.StreamingCall, typ xdsresource.Type, state *resourceTypeState) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendMessageIfWritePendingLocked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines -389 to -391
if s.logger.V(2) {
s.logger.Infof("Re-requesting resources %v of type %q, as the stream has been recreated", names, typ.TypeURL())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be put on line 381 now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to get rid of this log message since the resources are anyway logged from sendMessageLocked.

@zasweq zasweq assigned easwars and unassigned zasweq Oct 25, 2024
@easwars easwars merged commit 67b9ebf into grpc:master Oct 25, 2024
16 checks passed
@easwars easwars deleted the ads_stream_pending_write_bit branch October 25, 2024 22:24
misvivek pushed a commit to misvivek/grpc-go that referenced this pull request Nov 7, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants