Fix flaky tests #19459

gosusnp · 2022-11-16T00:54:19Z

What

Tests involving resets can be flaky. It is due to the fact that we currently track job state in two different ways, one is in the jobs table, the other one is WorkflowState from the connection manager. Following a reset, we update part of the WorkflowState state, update the jobs db, clear resets from the reset table, then update the state.

One option could be to rewrite the state logic to have workers have the main source of truth in the DB rather than split. This is an issue that is likely to happen in the tests, however, due to scheduling and timing less likely to happen under real conditions.

This PR introduce a workaround to be able to wait on both states. The other solution should be discussed/prioritized later.

Closes #19398

How

Expose WorkflowState.running in the JobsDebugInfo API.
Add a condition on the tests involving a reset to wait for the WorkflowState.running flag to flip as well as the jobs status.

benmoriceau · 2022-11-16T00:57:18Z

airbyte-server/src/main/java/io/airbyte/server/handlers/JobHistoryHandler.java

@@ -122,7 +143,15 @@ public JobDebugInfoRead getJobDebugInfo(final JobIdRequestBody jobIdRequestBody)
    final Job job = jobPersistence.getJob(jobIdRequestBody.getId());
    final JobInfoRead jobinfoRead = jobConverter.getJobInfoRead(job);

-    return buildJobDebugInfoRead(jobinfoRead);
+    final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead);
+    if (temporalClient != null) {


Do we want to have that check or do the migration in cloud directly?

My preference is generally to have the backward compatible version rather than force a breaking change.
The clean up commit feels easier compared to dealing with broken changes with a P0 at the same time 😂

cgardens · 2022-11-28T19:39:18Z

airbyte-tests/src/acceptanceTests/java/io/airbyte/test/acceptance/BasicAcceptanceTests.java

+    // still be cleaning up some data in the reset table. This would be an argument for reworking the
+    // source of truth of the replication workflow state to be in DB rather than in Memory and
+    // serialized automagically by temporal
+    waitWhileJobIsRunning(apiClient.getJobsApi(), jobInfoRead.getJob(), Duration.ofMinutes(1));


@gosusnp - do we need this line at all? now that you've updated the handler won't waitWhileJobHasStatus (2 lines above) also get the correct source of truth? it seems redundant.

I believe those tests were flaky until this PR got merged.
Those two checks are different, the first one is looking at the DB state, this one is looking at temporal state. Until we fix the actual endpoint (either change how we handle clean up after a reset or have a single source of truth for state), yes, I think we still need this check.

ah. i see. they are using different endpoints. can you make sure we have an issue to follow up on this, please? it's pretty spooky.

Captured in https://github.com/airbytehq/airbyte/issues/19854

cgardens · 2022-11-28T23:58:13Z

airbyte-server/src/main/java/io/airbyte/server/handlers/JobHistoryHandler.java

+    final JobDebugInfoRead jobDebugInfoRead = buildJobDebugInfoRead(jobinfoRead);
+    if (temporalClient != null) {
+      final UUID connectionId = UUID.fromString(job.getScope());
+      temporalClient.getWorkflowState(connectionId)


@gosusnp will this break if the next job has started? basically for connection X:

job 1 happens and completes.

then job 2 starts.

while job 2 is running i look up the debug info for job 1. won't it say running (because this getWorkflowState is keyed only by connection id), but i would expect it to not be running.

It think what you described is correct since workflowState is per connection id and has no history.

The other option that was considered was to have a different API endpoint, I think this felt like a smaller lift at that time to remove some noise from the tests.

got it. we should at least have a todo or warning in the code then. that's going to catch someone by surprise. i didn't figure it out until i was in the shower a couple hours later. 😅

* Expose WorkflowState in JobsDebugInfo * Make attribute required * Update the tests * Protect more tests

gosusnp added 4 commits November 15, 2022 16:14

Expose WorkflowState in JobsDebugInfo

228dc1c

Make attribute required

5ac17e2

Update the tests

41d0a33

Protect more tests

95f376b

gosusnp requested review from colesnodgrass, jdpgrailsdev, benmoriceau and alovew November 16, 2022 00:54

octavia-squidington-iv added area/api Related to the api area/documentation Improvements or additions to documentation area/platform issues related to the platform area/server labels Nov 16, 2022

benmoriceau reviewed Nov 16, 2022

View reviewed changes

benmoriceau approved these changes Nov 16, 2022

View reviewed changes

gosusnp merged commit 693f976 into master Nov 16, 2022

gosusnp deleted the gosusnp/testincremental branch November 16, 2022 17:18

github-actions bot mentioned this pull request Nov 17, 2022

Bump helm chart version reference to 0.40.44 #19546

Merged

octavia-squidington-iii mentioned this pull request Nov 18, 2022

Bump Airbyte version from 0.40.18 to 0.40.19 #19579

Merged

cgardens reviewed Nov 28, 2022

View reviewed changes

akashkulk pushed a commit that referenced this pull request Dec 2, 2022

Fix flaky tests (#19459)

5b95bd5

* Expose WorkflowState in JobsDebugInfo * Make attribute required * Update the tests * Protect more tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky tests #19459

Fix flaky tests #19459

Uh oh!

gosusnp commented Nov 16, 2022

Uh oh!

benmoriceau Nov 16, 2022

Uh oh!

gosusnp Nov 16, 2022

Uh oh!

cgardens Nov 28, 2022

Uh oh!

gosusnp Nov 28, 2022

Uh oh!

cgardens Nov 28, 2022

Uh oh!

gosusnp Nov 28, 2022

Uh oh!

cgardens Nov 28, 2022

Uh oh!

gosusnp Nov 29, 2022

Uh oh!

cgardens Nov 29, 2022

Uh oh!

Uh oh!

Fix flaky tests #19459

Fix flaky tests #19459

Uh oh!

Conversation

gosusnp commented Nov 16, 2022

What

How

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!