Skip to content

Recover from quarantine and version state issue #10932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #10933
benmoriceau opened this issue Mar 8, 2022 · 1 comment · Fixed by #13071
Closed
Tracked by #10933

Recover from quarantine and version state issue #10932

benmoriceau opened this issue Mar 8, 2022 · 1 comment · Fixed by #13071
Assignees
Labels
area/platform issues related to the platform team/platform-move

Comments

@benmoriceau
Copy link
Contributor

benmoriceau commented Mar 8, 2022

What

If a workflow is in a quarantined state, we need to allow to try to restart from a clean state. This clean state need to be reach through the cancellation of the running job and their related attempt. The only operation allowed on a quarantine workflow is to cancel the sync.

When in a quarantined state and receiving a cancel signal, we will need to "force" the restart of all the related jobs to a given job. It needs to:

  • Get all the currently non terminal jobs
  • Fail all those jobs and their related attempt
  • Continue as new

In order to do that, if any manual operation is performed on a quarantined workflow we will:

  • Terminate the current workflow
  • Restart it which will trigger the cleaning of the state

Restart fail

If the update of the DB fails, we need to go back to a quarantined state.

Version issue state

if the workflow is not reachable, it can be because of a version issue. In order to handle those we will try to terminate all the workflow that are unreachable like we do for the quarantine ones.

DoD

If a workflow is in a quarantined stated, we can receive a signal to force them to be unstuck.

@benmoriceau benmoriceau added type/enhancement New feature or request needs-triage area/platform issues related to the platform and removed type/enhancement New feature or request needs-triage labels Mar 8, 2022
@davinchia
Copy link
Contributor

This happens when the connection manager workflow is quarantined?

@terencecho terencecho assigned terencecho and unassigned terencecho Mar 25, 2022
@terencecho terencecho removed their assignment Apr 21, 2022
@benmoriceau benmoriceau changed the title Recover from quarantine Recover from quarantine and version state issue May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/platform-move
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants