continue workflows on restarts #10294

jrhizor · 2022-02-12T01:08:26Z

Please first take a look at the test cases to confirm that they're testing the correct behavior, then look at the settings changes for the sync workflow, and then finally at the handling for a deploy that happens while files are copied onto the pod.

Going to separate the timing change (30s -> 1s) to a separate PR to make it env var configurable.

jrhizor · 2022-02-15T17:38:11Z

exciting times

...r-orchestrator/src/main/java/io/airbyte/container_orchestrator/ContainerOrchestratorApp.java

cgardens · 2022-02-15T18:51:17Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

@@ -129,8 +129,7 @@ public void run(final ConnectionUpdaterInput connectionUpdaterInput) throws Retr
                ChildWorkflowOptions.newBuilder()
                    .setWorkflowId("sync_" + maybeJobId.get())
                    .setTaskQueue(TemporalJobType.CONNECTION_UPDATER.name())
-                    // This will cancel the child workflow when the parent is terminated
-                    .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_TERMINATE)
+                    .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_REQUEST_CANCEL)


Added some clarifying comments.

is the idea that sending the cancel signal means that we can retry the launcher activity? terminate means it cannot be retried.

cancelling the launcher activity does not cancel its children, right? because when we resume we are reconnecting to the children. it just kills the launcher activity so it can be retried subsequently.

We do want cancellation to cancel its children. Otherwise, cancellation would not be able to stop a sync in progress. The current LauncherWorker implementation kills everything for the connection when cancelled. This flag is necessary for this behavior; otherwise the cancellation doesn't propagate and the async kubernetes process is orphaned forever.

I'll add info from our last two messages in the comments, but the behavior is correct imo.

cool. based on added info. i agree.

cgardens · 2022-02-15T18:53:49Z

...rkers/src/main/java/io/airbyte/workers/temporal/scheduling/shared/ActivityConfiguration.java

@@ -24,12 +26,20 @@
  private static final int MAX_SYNC_TIMEOUT_DAYS = configs.getSyncJobMaxTimeoutDays();
  private static final Duration DB_INTERACTION_TIMEOUT = Duration.ofSeconds(configs.getMaxActivityTimeoutSecond());

+  // retry infinitely if the worker is killed without exceptions and dies due to timeouts


what are examples of when this happens? do we risk getting stuck forever?

We rethrow everything as WorkerExceptions into RuntimeExceptions, so anything we do and catch should not be retried by this policy. Anything heartbeat related however should retry, which is what this allows.

Specifically we want to retry timeouts. The only risk of getting stuck forever would be if there was some way to get stuck in a heartbeat failure state consistently forever, which should not be possible. @benmoriceau any thoughts on this?

okay. can you adjust the comment to explain this?

can you explain why it it not possible to get stuck in a heartbeat failure state forever?

Talked with Benoit about this in person. The activity has the heartbeat timeout configured, so it's not possible to be stuck waiting for a heartbeat. However, it's worth adding a test explicitly to make sure that all other exceptions we may throw for the TemporalUtils wrapper call are properly handled by this setting. I'm going to add a couple of unit tests for this to make sure we don't violate that expectation for this constant.

73b4668 has the tests for this. @benmoriceau and @cgardens is this sufficient? I did have to adjust the type of exception.

airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java

airbyte-workers/src/main/java/io/airbyte/workers/temporal/sync/LauncherWorker.java

cgardens · 2022-02-15T19:27:26Z

airbyte-workers/src/main/java/io/airbyte/workers/temporal/sync/LauncherWorker.java

-              fileMap,
-              portMap);
+          log.info("Creating " + podName + " for attempt number: " + jobRunConfig.getAttemptId());
+          killRunningPodsForConnection(podName);


my understanding from our conversation about state machine yesterday was that we

determine what pod we want to run

kill any running pod (for the connection) that doesn't match it

then if no pod is running that matches out running pod, start a pod

my understanding from this code change is the killing is now only happening in this conditional statement so the case where there is something running that doesn't match what we are looking for, then it won't get killed? is that what we want? am i misunderstanding?

It's more:

Determine what pod we want to run

If we haven't successfully initialized this pod in the past, kill any pod for that connection and create. If we can't create fail the attempt.

Then attach to the pod (whether or not it was initialized this run)

Since we always run the deletion when creating, it eliminates the risk of our filtering down to a specific podName having an error in some way; it just axes everything for that connection id.

Do you think that's sufficiently close to the state machine diagram or should we change it on one side?

yup. i'm convinced.

…ner_orchestrator/ContainerOrchestratorApp.java Co-authored-by: Charles <[email protected]>

…/LauncherWorker.java Co-authored-by: Charles <[email protected]>

jrhizor · 2022-02-17T16:31:44Z

It actually looks like NO_RETRY actually does pass the tests. It looks like heartbeats are handled completely out of band. I'll still keep the tests I have, but I think we can just switch to the other flag.

jrhizor added 8 commits February 8, 2022 14:54

fix normalization output processing in container orchestrator

955451a

add full scheduler v2 acceptance tests

8a74f65

Merge branch 'master' into jrhizor/double-test

cfc835d

speed up tests

451dd29

fixes

51b5324

clean up

267002f

Merge branch 'master' into jrhizor/double-test

3ace403

wip handle worker restarts

010625d

github-actions bot added area/platform issues related to the platform area/worker Related to worker kubernetes labels Feb 12, 2022

Merge branch 'master' into jrhizor/continue-workflows-on-restart

4dd7041

github-actions bot removed the kubernetes label Feb 12, 2022

jrhizor temporarily deployed to more-secrets February 12, 2022 01:11 Inactive

Merge branch 'master' into jrhizor/continue-workflows-on-restart

2b08f21

jrhizor temporarily deployed to more-secrets February 14, 2022 22:13 Inactive

only downtime during sync test not passing

685bdb3

jrhizor temporarily deployed to more-secrets February 15, 2022 01:32 Inactive

commit temp

e3a4e96

jrhizor temporarily deployed to more-secrets February 15, 2022 17:39 Inactive

mostly cleaned up

dfb6f39

jrhizor temporarily deployed to more-secrets February 15, 2022 17:51 Inactive

add attempt count check

0b3647d

jrhizor changed the title ~~wip continue workflows on restarts~~ continue workflows on restarts Feb 15, 2022

cgardens reviewed Feb 15, 2022

View reviewed changes

Update airbyte-container-orchestrator/src/main/java/io/airbyte/contai…

01e593e

…ner_orchestrator/ContainerOrchestratorApp.java Co-authored-by: Charles <[email protected]>

jrhizor temporarily deployed to more-secrets February 15, 2022 23:36 Inactive

jrhizor temporarily deployed to more-secrets February 15, 2022 23:37 Inactive

jrhizor and others added 2 commits February 15, 2022 15:42

Update airbyte-workers/src/main/java/io/airbyte/workers/temporal/sync…

2933821

…/LauncherWorker.java Co-authored-by: Charles <[email protected]>

add more context

13acec1

jrhizor temporarily deployed to more-secrets February 15, 2022 23:45 Inactive

remove unused arg

51f2007

jrhizor temporarily deployed to more-secrets February 15, 2022 23:49 Inactive

jrhizor temporarily deployed to more-secrets February 15, 2022 23:50 Inactive

Merge branch 'master' into jrhizor/continue-workflows-on-restart

df99fc0

jrhizor temporarily deployed to more-secrets February 17, 2022 15:17 Inactive

jrhizor added 2 commits February 17, 2022 07:18

test on CI that no_retry is insufficient

d5e6dd8

revert back to orchestrator retry

ea5f565

jrhizor temporarily deployed to more-secrets February 17, 2022 15:19 Inactive

test for retry logic

73b4668

jrhizor temporarily deployed to more-secrets February 17, 2022 15:54 Inactive

jrhizor added 2 commits February 17, 2022 14:46

remove fialing test and switch back activity config to just no retry

5203799

Merge branch 'master' into jrhizor/continue-workflows-on-restart

223da73

jrhizor temporarily deployed to more-secrets February 17, 2022 22:48 Inactive

jrhizor temporarily deployed to more-secrets February 17, 2022 22:49 Inactive

benmoriceau approved these changes Feb 17, 2022

View reviewed changes

jrhizor merged commit a66d8be into master Feb 17, 2022

jrhizor deleted the jrhizor/continue-workflows-on-restart branch February 17, 2022 23:14

octavia-squidington-iii mentioned this pull request Feb 17, 2022

Bump Airbyte version from 0.35.30-alpha to 0.35.31-alpha #10446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continue workflows on restarts #10294

continue workflows on restarts #10294

jrhizor commented Feb 12, 2022 •

edited

Loading

jrhizor commented Feb 15, 2022

cgardens Feb 15, 2022

jrhizor Feb 15, 2022

cgardens Feb 16, 2022

jrhizor Feb 16, 2022

jrhizor Feb 16, 2022

cgardens Feb 16, 2022

cgardens Feb 15, 2022

jrhizor Feb 15, 2022

cgardens Feb 16, 2022

cgardens Feb 16, 2022

jrhizor Feb 16, 2022

jrhizor Feb 17, 2022 •

edited

Loading

cgardens Feb 15, 2022

jrhizor Feb 15, 2022

cgardens Feb 16, 2022

jrhizor commented Feb 17, 2022

continue workflows on restarts #10294

continue workflows on restarts #10294

Conversation

jrhizor commented Feb 12, 2022 • edited Loading

jrhizor commented Feb 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhizor Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhizor commented Feb 17, 2022

jrhizor commented Feb 12, 2022 •

edited

Loading

jrhizor Feb 17, 2022 •

edited

Loading