Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

Fix flaky e2e tests #4844

Merged
merged 13 commits into from
Jun 28, 2022
Merged

Fix flaky e2e tests #4844

merged 13 commits into from
Jun 28, 2022

Conversation

keithmattix
Copy link
Contributor

@keithmattix keithmattix commented Jun 23, 2022

Signed-off-by: Keith Mattix II [email protected]

Description:
Fixes flaky e2e test behavior observed in CI
Fixes #4843

Testing done:
Local e2e testing and multiple CI runs to validate that flakiness has ceased

Affected area:

Functional Area
Tests [X]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project?

    • Did you notify the maintainers and provide attribution?
  2. Is this a breaking change?

  3. Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)?

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2022

Codecov Report

Merging #4844 (16865c8) into main (f3966a3) will decrease coverage by 0.18%.
The diff coverage is 16.66%.

@@            Coverage Diff             @@
##             main    #4844      +/-   ##
==========================================
- Coverage   69.52%   69.33%   -0.19%     
==========================================
  Files         219      219              
  Lines       16002    16012      +10     
==========================================
- Hits        11125    11102      -23     
- Misses       4823     4856      +33     
  Partials       54       54              
Flag Coverage Δ
unittests 69.33% <16.66%> (-0.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/osm-bootstrap/osm-bootstrap.go 47.87% <0.00%> (ø)
cmd/osm-controller/osm-controller.go 14.46% <0.00%> (ø)
pkg/certificate/providers/mrc.go 0.00% <0.00%> (-35.49%) ⬇️
pkg/certificate/types.go 100.00% <ø> (ø)
pkg/certificate/manager.go 76.70% <25.00%> (-1.72%) ⬇️
pkg/certificate/providers/config.go 68.29% <0.00%> (-8.54%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3966a3...16865c8. Read the comment docs.

Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
@keithmattix keithmattix marked this pull request as ready for review June 28, 2022 00:19
@nojnhuh
Copy link
Contributor

nojnhuh commented Jun 28, 2022

I threw together a minimal example of something that shows similar behavior:

var _ = Describe("", func() {
        It("", func() {
                bools := []bool{
                        false,
                        true,
                }
                i := 0

                err := wait.PollImmediate(1*time.Millisecond, 2*time.Millisecond, func() (bool, error) {
                        b := bools[i]
                        i++
                        return Expect(b).To(BeTrue()), nil
                })
                Expect(err).NotTo(HaveOccurred())
        })
})

I have a feeling the root of the problem is that even if the Expects inside the polled function fail once within the timeout, they'll still panic and stop execution even if we'd want it to try again. So if the poll function succeeds right away it'll look normal, but if it fails even once before that it'll panic instead of trying again. That might explain the flakiness.

return Expect(result.StatusCode).To(Equal(503)) &&
// upstream_rq_retry: Total request retries
Expect(metrics["upstream_rq_retry"]).To(Equal("0")) &&
// upstream_rq_retry_limit_exceeded: Total requests not retried because max retries reached
Expect(metrics["upstream_rq_retry_limit_exceeded"]).To(Equal("0")) &&
// upstream_rq_retry_backoff_exponential: Total retries using the exponential backoff strategy
Expect(metrics["upstream_rq_retry_backoff_exponential"]).To(Equal("0")), nil

Instead of using the return value from Expect, I'd use the values directly. In my example, that would mean inside the poll function return b, nil instead of return Expect(b).To(BeTrue()), nil. We'd need some extra logging in the actual test to see exactly which condition is failing, but the Expect(err).NotTo(HaveOccurred()) after the polling is done should ultimately catch anything that goes wrong.

Then if we're not calling Expect inside the poll function we don't need to worry about GinkgoRecover.

@keithmattix
Copy link
Contributor Author

So, the flakiness actual seems to be caused by incomplete cleanup between tests. There's an unwanted MRC manifest that's left behind in a different namespace from a different test that resulted in two different certificates being used; thus, communication between services (and proxy->control plane traffic) failed. This uncovered a bug in how we were handling the informers; there was no namespace selection.

To your point about the panics though @nojnhuh, it doesn't appear we can return all the values we need to (take the retry e2e test for example) due to the signature of the wait.Poll function (the execution function must return a bool + error). I think my latest set of changes will stabilize the behavior we've been seeing though

Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
m.signingIssuer = c
m.mu.Unlock()
case mrc.Status.State == constants.MRCStateValidatingRollback || mrc.Status.State == constants.MRCStateValidatingRollout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if a MRC is in a state of IssuingRollback it should only be used for validating. I define the IssuingRollback state as the rollback of using this issuer to sign certificates.

Suggested change
case mrc.Status.State == constants.MRCStateValidatingRollback || mrc.Status.State == constants.MRCStateValidatingRollout:
case mrc.Status.State == constants.MRCStateIssuingRollback || mrc.Status.State == constants.MRCStateValidatingRollout:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just got a chance to look at the state-machine logic here. Isn't there a scenario where there's no signing issuer on startup? If you get an MRC with status IssuingRollback and an MRC with status ValidatingRollout, there's no signing issuer. At least for startup purposes, IssuingRollback seems to imply that the MRC in question was the de-facto issuer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IssuingRollback state is set when an MRC enters the issuingRollout state. I don't think there should ever be a case where MRCs have the IssuingRollback and ValidatingRollout states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not blocking and can be addressed in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm so what does it mean of one MRC is active, and the other is ValidatingRollout? Keep in mind that the control plane could restart between status updates of two different MRCs

Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
@jaellio
Copy link
Contributor

jaellio commented Jun 28, 2022

@keithmattix Could you please update the PR description and include any testing you have done?

@keithmattix keithmattix changed the title Add a GinkgoRecover to start debugging flaky tests Fix flaky e2e tests Jun 28, 2022
@keithmattix keithmattix requested a review from nojnhuh June 28, 2022 18:52
@nojnhuh nojnhuh merged commit 4a3d57d into openservicemesh:main Jun 28, 2022
@keithmattix keithmattix deleted the fix-flaky-test branch June 28, 2022 19:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e_hashivault_test.go occasionally fails in CI
4 participants