-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Fix flaky tests #19459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky tests #19459
Conversation
@@ -122,7 +143,15 @@ public JobDebugInfoRead getJobDebugInfo(final JobIdRequestBody jobIdRequestBody) | |||
final Job job = jobPersistence.getJob(jobIdRequestBody.getId()); | |||
final JobInfoRead jobinfoRead = jobConverter.getJobInfoRead(job); | |||
|
|||
return buildJobDebugInfoRead(jobinfoRead); | |||
final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead); | |||
if (temporalClient != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have that check or do the migration in cloud directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is generally to have the backward compatible version rather than force a breaking change.
The clean up commit feels easier compared to dealing with broken changes with a P0 at the same time 😂
// still be cleaning up some data in the reset table. This would be an argument for reworking the | ||
// source of truth of the replication workflow state to be in DB rather than in Memory and | ||
// serialized automagically by temporal | ||
waitWhileJobIsRunning(apiClient.getJobsApi(), jobInfoRead.getJob(), Duration.ofMinutes(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gosusnp - do we need this line at all? now that you've updated the handler won't waitWhileJobHasStatus
(2 lines above) also get the correct source of truth? it seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe those tests were flaky until this PR got merged.
Those two checks are different, the first one is looking at the DB state, this one is looking at temporal state. Until we fix the actual endpoint (either change how we handle clean up after a reset or have a single source of truth for state), yes, I think we still need this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. i see. they are using different endpoints. can you make sure we have an issue to follow up on this, please? it's pretty spooky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead); | ||
if (temporalClient != null) { | ||
final UUID connectionId = UUID.fromString(job.getScope()); | ||
temporalClient.getWorkflowState(connectionId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gosusnp will this break if the next job has started? basically for connection X:
- job 1 happens and completes.
- then job 2 starts.
- while job 2 is running i look up the debug info for job 1. won't it say running (because this
getWorkflowState
is keyed only by connection id), but i would expect it to not be running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think what you described is correct since workflowState is per connection id and has no history.
The other option that was considered was to have a different API endpoint, I think this felt like a smaller lift at that time to remove some noise from the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. we should at least have a todo or warning in the code then. that's going to catch someone by surprise. i didn't figure it out until i was in the shower a couple hours later. 😅
* Expose WorkflowState in JobsDebugInfo * Make attribute required * Update the tests * Protect more tests
What
Tests involving resets can be flaky. It is due to the fact that we currently track job state in two different ways, one is in the jobs table, the other one is WorkflowState from the connection manager. Following a reset, we update part of the WorkflowState state, update the jobs db, clear resets from the reset table, then update the state.
One option could be to rewrite the state logic to have workers have the main source of truth in the DB rather than split. This is an issue that is likely to happen in the tests, however, due to scheduling and timing less likely to happen under real conditions.
This PR introduce a workaround to be able to wait on both states. The other solution should be discussed/prioritized later.
Closes #19398
How
WorkflowState.running
in theJobsDebugInfo
API.WorkflowState.running
flag to flip as well as the jobs status.