-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode that is failing with reproducible seed #17389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…is failing with reproducible seed Signed-off-by: Craig Perkins <[email protected]>
❌ Gradle check result for de2f7a2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17389 +/- ##
============================================
- Coverage 72.53% 72.47% -0.06%
+ Complexity 65814 65722 -92
============================================
Files 5311 5311
Lines 304945 304948 +3
Branches 44226 44226
============================================
- Hits 221193 221017 -176
- Misses 65619 65815 +196
+ Partials 18133 18116 -17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
One thing I've noticed is the the shardCount is wrong for this seed.
Status shows the following: When the test is successful it returns: |
… proper cleanup Signed-off-by: Craig Perkins <[email protected]>
❌ Gradle check result for fe9d1e6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 96d866c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 96d866c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 96d866c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for d123f8c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 30f2b90: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for 30f2b90: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
@cwperks -- I just got hit by this flaky test and was hoping that your PR will fix it. But, it looks like one of your runs yesterday failed with Do you know if that's another issue unrelated to your fix? Does your fix make the test less flaky? I'm definitely willing to accept less flaky. |
The changes in this PR fix an issue with a reproducible seed which would certainly cut down on flakiness in this test case. I just looked at the failure in #17389 (comment), but I can't reproduce with the same seed from that run when running the test locally. I think the changes in this PR will reduce the flakiness, but looks like test is still susceptible to failing if the snapshot repository that is created in the test isn't cleaned up in time. |
@msfroh any concerns with this flaky test fix? |
I also encountered a similar issue in my PR. |
Thanks @cwperks. Can we please get this merged, coming from the Gradle Check Metrics Dashboard, I can see this test in top 3. |
server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java
Show resolved
Hide resolved
So the failure in #17389 (comment) was not deterministic since it succeeded on a retry. It looks to me like this PR definitely fixes a bug, but some sort or race condition remains present in the test (probably unrelated to this fix). |
Signed-off-by: Craig Perkins <[email protected]>
One note, picking on a single failure:
every failure I've seen have been the case where this line generates |
For this particular seed, I always receive errors like this without the changes in this PR: Full Error
I had to take a pretty deep dive for this particular fix and my first attempt at a fix was this commit: de2f7a2 During the deep dive I noticed it occurred in scenarios where the Shard Snapshot was aborted and subsequently the generation was null.
I have not checked this specifically. I am not as familiar with snapshots as other part of the code base and wanted to take this flaky test on to dive further into this area. |
@cwperks Right, the failures always have something like |
❌ Gradle check result for 5af80bd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@cwperks Unfortunately I can get this PR to fail reliably with the following seed:
I've got a work-around PR here #17996 that does seem to resolve the flakiness (but not fix the underlying bug if there is one). |
Thank you for taking a look @andrross ! What is the error message you are receiving? The test is passing with that seed on my mac with this change. I'm not an expert in this area of the code base so I defer to the experts. I've hit this flaky tests many times and decided to take a stab at it a month ago. |
❌ Gradle check result for 5af80bd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
The error is below. Weirdly it seems to only reproduce on my Linux machine. It passes for me as well when running on my Mac.
|
I was able to see the following fail constantly.
|
I can get these to fail as well. They do pass with the change in #17996. I vote for merging #17996 and separately investigating whether there is a bug related to the other path types. |
Thank you @andrross! Closing this PR in favor of the one you opened up. |
Description
Fails consistently with:
This PR ensures that the generation in the ShardSnapshotStatus is not null in the case where the snapshot is aborted. Essentially the repository cleanup fails to cleanup files where a snapshot was aborted.
Related Issues
Resolves #15806
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.