Skip to content

continue workflows on restarts #10294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Feb 17, 2022
Merged

Conversation

jrhizor
Copy link
Contributor

@jrhizor jrhizor commented Feb 12, 2022

Please first take a look at the test cases to confirm that they're testing the correct behavior, then look at the settings changes for the sync workflow, and then finally at the handling for a deploy that happens while files are copied onto the pod.

Going to separate the timing change (30s -> 1s) to a separate PR to make it env var configurable.

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker kubernetes labels Feb 12, 2022
@jrhizor jrhizor temporarily deployed to more-secrets February 12, 2022 01:11 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 12, 2022 01:11 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 14, 2022 22:13 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 14, 2022 22:13 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 01:32 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 01:32 Inactive
@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 15, 2022

Screen Shot 2022-02-15 at 9 37 24 AM

exciting times

@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 17:39 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 17:39 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 17:51 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 17:51 Inactive
@jrhizor jrhizor changed the title wip continue workflows on restarts continue workflows on restarts Feb 15, 2022
@@ -129,8 +129,7 @@ public void run(final ConnectionUpdaterInput connectionUpdaterInput) throws Retr
ChildWorkflowOptions.newBuilder()
.setWorkflowId("sync_" + maybeJobId.get())
.setTaskQueue(TemporalJobType.CONNECTION_UPDATER.name())
// This will cancel the child workflow when the parent is terminated
.setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_TERMINATE)
.setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_REQUEST_CANCEL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some clarifying comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the idea that sending the cancel signal means that we can retry the launcher activity? terminate means it cannot be retried.

cancelling the launcher activity does not cancel its children, right? because when we resume we are reconnecting to the children. it just kills the launcher activity so it can be retried subsequently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want cancellation to cancel its children. Otherwise, cancellation would not be able to stop a sync in progress. The current LauncherWorker implementation kills everything for the connection when cancelled. This flag is necessary for this behavior; otherwise the cancellation doesn't propagate and the async kubernetes process is orphaned forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add info from our last two messages in the comments, but the behavior is correct imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. based on added info. i agree.

@@ -24,12 +26,20 @@
private static final int MAX_SYNC_TIMEOUT_DAYS = configs.getSyncJobMaxTimeoutDays();
private static final Duration DB_INTERACTION_TIMEOUT = Duration.ofSeconds(configs.getMaxActivityTimeoutSecond());

// retry infinitely if the worker is killed without exceptions and dies due to timeouts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are examples of when this happens? do we risk getting stuck forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rethrow everything as WorkerExceptions into RuntimeExceptions, so anything we do and catch should not be retried by this policy. Anything heartbeat related however should retry, which is what this allows.

Specifically we want to retry timeouts. The only risk of getting stuck forever would be if there was some way to get stuck in a heartbeat failure state consistently forever, which should not be possible. @benmoriceau any thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. can you adjust the comment to explain this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why it it not possible to get stuck in a heartbeat failure state forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Benoit about this in person. The activity has the heartbeat timeout configured, so it's not possible to be stuck waiting for a heartbeat. However, it's worth adding a test explicitly to make sure that all other exceptions we may throw for the TemporalUtils wrapper call are properly handled by this setting. I'm going to add a couple of unit tests for this to make sure we don't violate that expectation for this constant.

Copy link
Contributor Author

@jrhizor jrhizor Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

73b4668 has the tests for this. @benmoriceau and @cgardens is this sufficient? I did have to adjust the type of exception.

fileMap,
portMap);
log.info("Creating " + podName + " for attempt number: " + jobRunConfig.getAttemptId());
killRunningPodsForConnection(podName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding from our conversation about state machine yesterday was that we

  • determine what pod we want to run
  • kill any running pod (for the connection) that doesn't match it
  • then if no pod is running that matches out running pod, start a pod

my understanding from this code change is the killing is now only happening in this conditional statement so the case where there is something running that doesn't match what we are looking for, then it won't get killed? is that what we want? am i misunderstanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more:

  • Determine what pod we want to run
  • If we haven't successfully initialized this pod in the past, kill any pod for that connection and create. If we can't create fail the attempt.
  • Then attach to the pod (whether or not it was initialized this run)

Since we always run the deletion when creating, it eliminates the risk of our filtering down to a specific podName having an error in some way; it just axes everything for that connection id.

Do you think that's sufficiently close to the state machine diagram or should we change it on one side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. i'm convinced.

…ner_orchestrator/ContainerOrchestratorApp.java

Co-authored-by: Charles <[email protected]>
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:36 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:37 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:45 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:45 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:49 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 15, 2022 23:50 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:17 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:17 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:19 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:19 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:54 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 15:54 Inactive
@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 17, 2022

It actually looks like NO_RETRY actually does pass the tests. It looks like heartbeats are handled completely out of band. I'll still keep the tests I have, but I think we can just switch to the other flag.

@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 22:48 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets February 17, 2022 22:49 Inactive
@jrhizor jrhizor merged commit a66d8be into master Feb 17, 2022
@jrhizor jrhizor deleted the jrhizor/continue-workflows-on-restart branch February 17, 2022 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants