Skip to content

If an activity is failing, stuck the workflow and make it queriable #10121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 15, 2022

Conversation

benmoriceau
Copy link
Contributor

What

After an activity failure, we are blocking the workflow. A new query method is available to query the workflows to get the list of workflow being stuck.
Then the activity can be retry with a signal.

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Feb 5, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets February 8, 2022 21:59 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 8, 2022 23:04 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 8, 2022 23:34 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 8, 2022 23:52 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 8, 2022 23:59 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 9, 2022 00:03 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 9, 2022 00:14 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 9, 2022 00:24 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 10, 2022 23:03 Inactive
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a way to reset the job to not stuck / not quarantined as well?

* activity.
*/
@SignalMethod
void retryFailActivity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryFailedActivity ?

@@ -77,4 +91,22 @@
@QueryMethod
JobInformation getJobInformation();

@Data
@NoArgsConstructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a no args constructor?

Comment on lines 383 to 397
private <INPUT> void runActivity(Consumer<INPUT> consumer, INPUT input) {
try {
consumer.accept(input);
} catch (Exception e) {
log.error("Failed to run an activity for the connection " + connectionId, e);
workflowState.setStuck(true);
workflowState.setRetryFailedActivity(false);
Workflow.await(() -> workflowState.isRetryFailedActivity());
log.error("Retrying an activity for the connection " + connectionId, e);
workflowState.setStuck(false);
workflowState.setRetryFailedActivity(false);
runActivity(consumer, input);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we dry this up. the duplicate code here is going to be easy to have go out of sync.

Suggested change
private <INPUT> void runActivity(Consumer<INPUT> consumer, INPUT input) {
try {
consumer.accept(input);
} catch (Exception e) {
log.error("Failed to run an activity for the connection " + connectionId, e);
workflowState.setStuck(true);
workflowState.setRetryFailedActivity(false);
Workflow.await(() -> workflowState.isRetryFailedActivity());
log.error("Retrying an activity for the connection " + connectionId, e);
workflowState.setStuck(false);
workflowState.setRetryFailedActivity(false);
runActivity(consumer, input);
}
}
private <INPUT> void runActivity2(final Consumer<INPUT> consumer, final INPUT input) {
// make the consumer look like a function so that we can reuse logic in runActivityWithOutput
runActivityWithOutput((inputInternal) -> {
consumer.accept(inputInternal);
return null;
}, input);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -334,4 +365,34 @@ private void continueAsNew(final ConnectionUpdaterInput connectionUpdaterInput)
}
}

private <INPUT, OUTPUT> OUTPUT runActivityWithOutput(Function<INPUT, OUTPUT> mapper, INPUT input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only for a subset of activities. should the method name say short activities or mandatory activities or something? also can we add a javadoc comment to this method explaining why it is necessary, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Data
@NoArgsConstructor
@AllArgsConstructor
class StuckInformation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we replace the word "stuck" with "quarantined", please? quarantined is a common state name for a workflow that needs manual intervention. within that we can add information on the reason it is stuck.

@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:26 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:26 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:48 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:49 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:57 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 19:57 Inactive
@benmoriceau
Copy link
Contributor Author

as

Yes, but it should be in another PR.

@benmoriceau
Copy link
Contributor Author

benmoriceau commented Feb 15, 2022

retryFailedActivity ?

Done

@benmoriceau
Copy link
Contributor Author

why do we need a no args constructor?

It is needed for the internal temporal serialization

@benmoriceau
Copy link
Contributor Author

benmoriceau commented Feb 15, 2022

can we replace the word "stuck" with "quarantined", please? quarantined is a common state name for a workflow that needs manual intervention. within that we can add information on the reason it is stuck.

Done

@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 21:57 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets February 15, 2022 21:57 Inactive
@benmoriceau benmoriceau merged commit ab10996 into master Feb 15, 2022
@benmoriceau benmoriceau deleted the bmoric/signal-stuck branch February 15, 2022 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants