-
Notifications
You must be signed in to change notification settings - Fork 4.5k
stats/openetelemetry: refactor and make e2e test stats verification deterministic #8077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dc1badd
to
67b38cd
Compare
Sorry, lots of work on new team so no bandwidth to review this. Glad you worked on this though! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8077 +/- ##
==========================================
+ Coverage 82.15% 82.29% +0.13%
==========================================
Files 387 387
Lines 39067 38963 -104
==========================================
- Hits 32094 32063 -31
+ Misses 5643 5581 -62
+ Partials 1330 1319 -11
|
2038431
to
dc1e2ba
Compare
171c066
to
82ba13e
Compare
82ba13e
to
3e8fc54
Compare
7d96428
to
f83d6aa
Compare
// Verify spans has error status with rpcErrorMsg as error message. | ||
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -46,7 +46,7 @@ func populateSpan(rs stats.RPCStats, ai *attemptInfo) { | |||
// correctness. | |||
span.SetAttributes( | |||
attribute.Bool("Client", rs.Client), | |||
attribute.Bool("FailFast", rs.Client), | |||
attribute.Bool("FailFast", rs.FailFast), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bugfix? If yes, should we have a release note and test, possibly in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we never released the opentelemetry tracing api anyways. This will be the first time. However, it looks like the release notes were not picked up for it https://github.com/grpc/grpc-go/releases
1d2b508
to
5f22f2d
Compare
cds: stop child policies on resource-not-found errors (grpc#8122) xds: simplify code handling certain error conditions in the resolver (grpc#8123) xds, pickfirst: Enable additional addresses in xDS, set new pick_first as default (grpc#8126) github: change test action to cover the legacy pickfirst balancer (grpc#8129) cleanup: replace dial with newclient (grpc#7967) cleanup: replace dial with newclient (grpc#7970) stats/openetelemetry: refactor and make e2e test stats verification deterministic (grpc#8077) xds: introduce simple grpc transport for generic xds clients (grpc#8066) xds: generic xds client common configs re-push comments improve ServerConfig equal easwar review round 1 on documentation easwar comments on docstrings easwar comments round 4 config tests merge with previous pr xds: add lrs client and xDS client interfaces second pass to documentation language change from godoc review dfawley review 2 easwar review 1 changed to decoder struct move authorities under xds client easwar review 2 easwars review 4 ResourceWatcher done and LoadStore stop grpc based transport remove server config extension interface add byte codec dfawley review 1 send and recv tests with byte based test server change to proto based server easwar review 1 easwar review 3 xds: generic xds client ads transport channel
ads: stop child policies on resource-not-found errors (grpc#8122) xds: simplify code handling certain error conditions in the resolver (grpc#8123) xds, pickfirst: Enable additional addresses in xDS, set new pick_first as default (grpc#8126) github: change test action to cover the legacy pickfirst balancer (grpc#8129) cleanup: replace dial with newclient (grpc#7967) cleanup: replace dial with newclient (grpc#7970) stats/openetelemetry: refactor and make e2e test stats verification deterministic (grpc#8077) xds: introduce simple grpc transport for generic xds clients (grpc#8066) xds: generic xds client common configs re-push comments improve ServerConfig equal easwar review round 1 on documentation easwar comments on docstrings easwar comments round 4 config tests merge with previous pr xds: add lrs client and xDS client interfaces second pass to documentation language change from godoc review dfawley review 2 easwar review 1 changed to decoder struct move authorities under xds client easwar review 2 easwars review 4 ResourceWatcher done and LoadStore stop grpc based transport remove server config extension interface add byte codec dfawley review 1 send and recv tests with byte based test server change to proto based server easwar review 1 easwar review 3 xds: generic xds client ads transport channel
Fixes: #8053 #7423 #8076
The flakiness in stats/opentelemetry e2e tests flakiness is related to asynchronous metrics and trace collection. gRPC server metrics, specifically those generated by
stats.End
calls, are processed asynchronously. This means that immediately after a client-side RPC completes, the corresponding server-side metrics may not be available yet. To handle this asynchronicity, the tests employ a polling mechanism that repeatedly checks for the expected metrics until they appear or a timeout occurs. This polling strategy ensures that the tests wait for the server to finish processing the metrics, preventing premature validation against incomplete data.However, the following limitations are still there which are causing the flakiness
sent_total_compressed_message_size
, did not adequately account for asynchronous data point population. The tests would sometimes check the histogram before all expected data points were present, resulting in false negatives.duration
metrics did not use the above polling mechanism, specifically for histogram metrics.This PR fixes these flakiness issues by adding the appropriate polling before the metric and trace validation logic.
waitForServerCompletedRPCs
now not only wait for required metric to be present but also wait for the number of histogram data points to match the expected values before proceeding with validation.duration
metrics validation check now alsowaitForServerCompletedRPCs
.RELEASE NOTES: None