Skip to content

Fix flaky tests #19459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 16, 2022
Merged

Fix flaky tests #19459

merged 4 commits into from
Nov 16, 2022

Conversation

gosusnp
Copy link
Contributor

@gosusnp gosusnp commented Nov 16, 2022

What

Tests involving resets can be flaky. It is due to the fact that we currently track job state in two different ways, one is in the jobs table, the other one is WorkflowState from the connection manager. Following a reset, we update part of the WorkflowState state, update the jobs db, clear resets from the reset table, then update the state.

One option could be to rewrite the state logic to have workers have the main source of truth in the DB rather than split. This is an issue that is likely to happen in the tests, however, due to scheduling and timing less likely to happen under real conditions.

This PR introduce a workaround to be able to wait on both states. The other solution should be discussed/prioritized later.

Closes #19398

How

  • Expose WorkflowState.running in the JobsDebugInfo API.
  • Add a condition on the tests involving a reset to wait for the WorkflowState.running flag to flip as well as the jobs status.

@octavia-squidington-iv octavia-squidington-iv added area/api Related to the api area/documentation Improvements or additions to documentation area/platform issues related to the platform area/server labels Nov 16, 2022
@@ -122,7 +143,15 @@ public JobDebugInfoRead getJobDebugInfo(final JobIdRequestBody jobIdRequestBody)
final Job job = jobPersistence.getJob(jobIdRequestBody.getId());
final JobInfoRead jobinfoRead = jobConverter.getJobInfoRead(job);

return buildJobDebugInfoRead(jobinfoRead);
final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead);
if (temporalClient != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have that check or do the migration in cloud directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is generally to have the backward compatible version rather than force a breaking change.
The clean up commit feels easier compared to dealing with broken changes with a P0 at the same time 😂

// still be cleaning up some data in the reset table. This would be an argument for reworking the
// source of truth of the replication workflow state to be in DB rather than in Memory and
// serialized automagically by temporal
waitWhileJobIsRunning(apiClient.getJobsApi(), jobInfoRead.getJob(), Duration.ofMinutes(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gosusnp - do we need this line at all? now that you've updated the handler won't waitWhileJobHasStatus (2 lines above) also get the correct source of truth? it seems redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those tests were flaky until this PR got merged.
Those two checks are different, the first one is looking at the DB state, this one is looking at temporal state. Until we fix the actual endpoint (either change how we handle clean up after a reset or have a single source of truth for state), yes, I think we still need this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. i see. they are using different endpoints. can you make sure we have an issue to follow up on this, please? it's pretty spooky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead);
if (temporalClient != null) {
final UUID connectionId = UUID.fromString(job.getScope());
temporalClient.getWorkflowState(connectionId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gosusnp will this break if the next job has started? basically for connection X:

  • job 1 happens and completes.
  • then job 2 starts.
  • while job 2 is running i look up the debug info for job 1. won't it say running (because this getWorkflowState is keyed only by connection id), but i would expect it to not be running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think what you described is correct since workflowState is per connection id and has no history.

The other option that was considered was to have a different API endpoint, I think this felt like a smaller lift at that time to remove some noise from the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. we should at least have a todo or warning in the code then. that's going to catch someone by surprise. i didn't figure it out until i was in the shower a couple hours later. 😅

akashkulk pushed a commit that referenced this pull request Dec 2, 2022
* Expose WorkflowState in JobsDebugInfo

* Make attribute required

* Update the tests

* Protect more tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Related to the api area/documentation Improvements or additions to documentation area/platform issues related to the platform area/server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BasicAcceptanceTests.testIncrementalSync is flaky
4 participants