Skip to content

stats/openetelemetry: refactor and make e2e test stats verification deterministic #8077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 28, 2025

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Feb 11, 2025

Fixes: #8053 #7423 #8076

The flakiness in stats/opentelemetry e2e tests flakiness is related to asynchronous metrics and trace collection. gRPC server metrics, specifically those generated by stats.End calls, are processed asynchronously. This means that immediately after a client-side RPC completes, the corresponding server-side metrics may not be available yet. To handle this asynchronicity, the tests employ a polling mechanism that repeatedly checks for the expected metrics until they appear or a timeout occurs. This polling strategy ensures that the tests wait for the server to finish processing the metrics, preventing premature validation against incomplete data.

However, the following limitations are still there which are causing the flakiness

  1. the validation logic for histogram metrics, particularly sent_total_compressed_message_size, did not adequately account for asynchronous data point population. The tests would sometimes check the histogram before all expected data points were present, resulting in false negatives.
  2. the validation logic for duration metrics did not use the above polling mechanism, specifically for histogram metrics.
  3. the trace exporter was only queried once for collected spans, which could lead to missing spans due to the asynchronous nature of trace exporting. This resulted in incomplete trace verification and occasional test failures.

This PR fixes these flakiness issues by adding the appropriate polling before the metric and trace validation logic.

  1. the waitForServerCompletedRPCs now not only wait for required metric to be present but also wait for the number of histogram data points to match the expected values before proceeding with validation.
  2. The duration metrics validation check now also waitForServerCompletedRPCs.
  3. the trace exporter is polled repeatedly until all expected spans are available, ensuring complete trace collection.

RELEASE NOTES: None

@zasweq
Copy link
Contributor

zasweq commented Feb 11, 2025

Sorry, lots of work on new team so no bandwidth to review this. Glad you worked on this though!

@zasweq zasweq removed their request for review February 11, 2025 19:51
@zasweq zasweq removed their assignment Feb 11, 2025
Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 59.45946% with 15 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (e95a4b7) to head (5f22f2d).
Report is 30 commits behind head on master.

Files with missing lines Patch % Lines
...tats/opentelemetry/internal/testutils/testutils.go 58.33% 10 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8077      +/-   ##
==========================================
+ Coverage   82.15%   82.29%   +0.13%     
==========================================
  Files         387      387              
  Lines       39067    38963     -104     
==========================================
- Hits        32094    32063      -31     
+ Misses       5643     5581      -62     
+ Partials     1330     1319      -11     
Files with missing lines Coverage Δ
stats/opentelemetry/trace.go 83.33% <100.00%> (ø)
...tats/opentelemetry/internal/testutils/testutils.go 94.69% <58.33%> (-0.88%) ⬇️

... and 48 files with indirect coverage changes

@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch 2 times, most recently from 2038431 to dc1e2ba Compare February 12, 2025 17:11
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch 4 times, most recently from 171c066 to 82ba13e Compare February 14, 2025 02:41
@arjan-bal arjan-bal removed their assignment Feb 18, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 18, 2025 13:44
@purnesh42H purnesh42H requested a review from arjan-bal February 24, 2025 19:42
@purnesh42H purnesh42H assigned arjan-bal and unassigned easwars Feb 24, 2025
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch from 7d96428 to f83d6aa Compare February 25, 2025 04:43
@arjan-bal arjan-bal removed their assignment Feb 25, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 25, 2025 12:12
@purnesh42H purnesh42H assigned arjan-bal and unassigned purnesh42H Feb 25, 2025
// Verify spans has error status with rpcErrorMsg as error message.
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?

@arjan-bal arjan-bal assigned purnesh42H and unassigned arjan-bal Feb 26, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 27, 2025 04:51
@purnesh42H purnesh42H assigned arjan-bal and unassigned purnesh42H Feb 27, 2025
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arjan-bal arjan-bal removed their assignment Feb 27, 2025
@@ -46,7 +46,7 @@ func populateSpan(rs stats.RPCStats, ai *attemptInfo) {
// correctness.
span.SetAttributes(
attribute.Bool("Client", rs.Client),
attribute.Bool("FailFast", rs.Client),
attribute.Bool("FailFast", rs.FailFast),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bugfix? If yes, should we have a release note and test, possibly in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we never released the opentelemetry tracing api anyways. This will be the first time. However, it looks like the release notes were not picked up for it https://github.com/grpc/grpc-go/releases

@dfawley dfawley assigned purnesh42H and unassigned dfawley Feb 27, 2025
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch from 1d2b508 to 5f22f2d Compare February 28, 2025 08:38
@purnesh42H purnesh42H merged commit 01080d5 into grpc:master Feb 28, 2025
15 checks passed
purnesh42H pushed a commit to purnesh42H/grpc-go that referenced this pull request Mar 1, 2025
cds: stop child policies on resource-not-found errors (grpc#8122)

xds: simplify code handling certain error conditions in the resolver (grpc#8123)

xds, pickfirst: Enable additional addresses in xDS, set new pick_first as default (grpc#8126)

github: change test action to cover the legacy pickfirst balancer (grpc#8129)

cleanup: replace dial with newclient (grpc#7967)

cleanup: replace dial with newclient (grpc#7970)

stats/openetelemetry: refactor and make e2e test stats verification deterministic (grpc#8077)

xds: introduce simple grpc transport for generic xds clients (grpc#8066)

xds: generic xds client common configs

re-push comments

improve ServerConfig equal

easwar review round 1 on documentation

easwar comments on docstrings

easwar comments round 4

config tests

merge with previous pr

xds: add lrs client and xDS client interfaces

second pass to documentation language

change from godoc review

dfawley review 2

easwar review 1

changed to decoder struct

move authorities under xds client

easwar review 2

easwars review 4

ResourceWatcher done and LoadStore stop

grpc based transport

remove server config extension interface

add byte codec

dfawley review 1

send and recv tests with byte based test server

change to proto based server

easwar review 1

easwar review 3

xds: generic xds client ads transport channel
purnesh42H pushed a commit to purnesh42H/grpc-go that referenced this pull request Mar 1, 2025
ads: stop child policies on resource-not-found errors (grpc#8122)

xds: simplify code handling certain error conditions in the resolver (grpc#8123)

xds, pickfirst: Enable additional addresses in xDS, set new pick_first as default (grpc#8126)

github: change test action to cover the legacy pickfirst balancer (grpc#8129)

cleanup: replace dial with newclient (grpc#7967)

cleanup: replace dial with newclient (grpc#7970)

stats/openetelemetry: refactor and make e2e test stats verification deterministic (grpc#8077)

xds: introduce simple grpc transport for generic xds clients (grpc#8066)

xds: generic xds client common configs

re-push comments

improve ServerConfig equal

easwar review round 1 on documentation

easwar comments on docstrings

easwar comments round 4

config tests

merge with previous pr

xds: add lrs client and xDS client interfaces

second pass to documentation language

change from godoc review

dfawley review 2

easwar review 1

changed to decoder struct

move authorities under xds client

easwar review 2

easwars review 4

ResourceWatcher done and LoadStore stop

grpc based transport

remove server config extension interface

add byte codec

dfawley review 1

send and recv tests with byte based test server

change to proto based server

easwar review 1

easwar review 3

xds: generic xds client ads transport channel
janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test Test/MethodAttributeFilter Flaky Test: MetricsAndTracesOptionEnabled Flaky Test: AllMetricsOneFunction
5 participants