-
Notifications
You must be signed in to change notification settings - Fork 276
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4844 +/- ##
==========================================
- Coverage 69.52% 69.33% -0.19%
==========================================
Files 219 219
Lines 16002 16012 +10
==========================================
- Hits 11125 11102 -23
- Misses 4823 4856 +33
Partials 54 54
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
… namespaces Signed-off-by: Keith Mattix II <[email protected]>
e61eb4d
to
8a147fc
Compare
I threw together a minimal example of something that shows similar behavior: var _ = Describe("", func() {
It("", func() {
bools := []bool{
false,
true,
}
i := 0
err := wait.PollImmediate(1*time.Millisecond, 2*time.Millisecond, func() (bool, error) {
b := bools[i]
i++
return Expect(b).To(BeTrue()), nil
})
Expect(err).NotTo(HaveOccurred())
})
}) I have a feeling the root of the problem is that even if the osm/tests/e2e/e2e_retry_policy_test.go Lines 255 to 261 in 8a147fc
Instead of using the return value from Then if we're not calling |
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
So, the flakiness actual seems to be caused by incomplete cleanup between tests. There's an unwanted MRC manifest that's left behind in a different namespace from a different test that resulted in two different certificates being used; thus, communication between services (and proxy->control plane traffic) failed. This uncovered a bug in how we were handling the informers; there was no namespace selection. To your point about the panics though @nojnhuh, it doesn't appear we can return all the values we need to (take the retry e2e test for example) due to the signature of the wait.Poll function (the execution function must return a bool + error). I think my latest set of changes will stabilize the behavior we've been seeing though |
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
m.signingIssuer = c | ||
m.mu.Unlock() | ||
case mrc.Status.State == constants.MRCStateValidatingRollback || mrc.Status.State == constants.MRCStateValidatingRollout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe if a MRC is in a state of IssuingRollback it should only be used for validating. I define the IssuingRollback state as the rollback of using this issuer to sign certificates.
case mrc.Status.State == constants.MRCStateValidatingRollback || mrc.Status.State == constants.MRCStateValidatingRollout: | |
case mrc.Status.State == constants.MRCStateIssuingRollback || mrc.Status.State == constants.MRCStateValidatingRollout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just got a chance to look at the state-machine logic here. Isn't there a scenario where there's no signing issuer on startup? If you get an MRC with status IssuingRollback and an MRC with status ValidatingRollout, there's no signing issuer. At least for startup purposes, IssuingRollback seems to imply that the MRC in question was the de-facto issuer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IssuingRollback state is set when an MRC enters the issuingRollout state. I don't think there should ever be a case where MRCs have the IssuingRollback and ValidatingRollout states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not blocking and can be addressed in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm so what does it mean of one MRC is active, and the other is ValidatingRollout? Keep in mind that the control plane could restart between status updates of two different MRCs
Signed-off-by: Keith Mattix II <[email protected]>
Signed-off-by: Keith Mattix II <[email protected]>
@keithmattix Could you please update the PR description and include any testing you have done? |
Signed-off-by: Keith Mattix II [email protected]
Description:
Fixes flaky e2e test behavior observed in CI
Fixes #4843
Testing done:
Local e2e testing and multiple CI runs to validate that flakiness has ceased
Affected area:
Please answer the following questions with yes/no.
Does this change contain code from or inspired by another project?
Is this a breaking change?
Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)?