Skip to content

fix: correct retry logic #13734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 21, 2024
Merged

fix: correct retry logic #13734

merged 13 commits into from
Nov 21, 2024

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 10, 2024

Fixes #12543 and other retry related issues.

Motivation

The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.

In addition to this the current logic does not take care of various edge cases, for example container sets.

It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.

Modifications

This is a complete rewrite of the FormulateRetryWorkflow function.

Verification

Extensive testing performed manually to verify that behaviour is as expected.

@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 10, 2024
@Joibel
Copy link
Member

Joibel commented Oct 17, 2024

This might supersede #12105

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
@isubasinghe isubasinghe marked this pull request as ready for review October 21, 2024 07:12
Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should actually test for individual node status as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

`

func TestStepsRetryWorkflow(t *testing.T) {
assert := assert.New(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this for assert and require here and elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was waiting for an opinion on #13734 (comment). I plan to use these when I fill in the tests.

I was unsure if it was worth the effort of these tests. I wrote these to test edge conditions and make sure they worked as expected. They will prevent regressions though, so I will add them.

@@ -1026,51 +1025,6 @@ func TestRetryExitHandler(t *testing.T) {
func TestFormulateRetryWorkflow(t *testing.T) {
ctx := context.Background()
wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns")
createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()}
finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)}
t.Run("Steps", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this entire test feels wrong, why can't it be checked still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really testable easily, we cannot establish a dag for these status fields. The root node is not present.
This will error out complaining that the root node couldn't be found.

Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

}

func isGroupNode(node wfv1.NodeStatus) bool {
return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps
func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename to resetUntil

reset instead of consume, as you call the resetFunc. Or call that consumeFunc

till has more meanings, until is much clearer if English isn't your first language.

default:
// Do not allow retry of workflows with pods in Running/Pending phase
return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase)
deleteNodes, err := getNodeIDsToResetNoChildren(restartSuccessful, nodeFieldSelector, wf.Status.Nodes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I have to set the --restart-successful in order to identify the nodes that need to be deleted, even if I only wish to retry a specific failed node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this wouldn't be the case

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

addToDelete(curr.n.ID, true)
children := getChildren(curr)
for childID := range children {
addToDelete(childID, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I follow and agree now. Thanks.

@isubasinghe isubasinghe merged commit 497f338 into argoproj:main Nov 21, 2024
28 checks passed
@Joibel Joibel deleted the fix-retry-logic branch November 29, 2024 10:17
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 5, 2024
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
@shuangkun
Copy link
Member

Could this PR be cherry-picked into the 3.6.5 branch? Some customers have inquired about it, but I haven't found any prior discussions related to the 3.6.x patches. Thanks @Joibel

@Joibel
Copy link
Member

Joibel commented Mar 6, 2025

I've created a patch releases discussion #14259.

I haven't backported this yet, it feels like #14124 should be addressed before this is backported otherwise we're knowingly introducing a regression.

@Joibel
Copy link
Member

Joibel commented Apr 23, 2025

/cherry-pick release-3.6

Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
(cherry picked from commit 497f338)
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <[email protected]>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <[email protected]>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <[email protected]>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <[email protected]>
kim-codefresh added a commit to codefresh-io/argo-workflows that referenced this pull request May 20, 2025
…abilities fixes (Cr 28355) (#358)

* fix: bump deps for k8schain to fix ecr-login (argoproj#14008) (release-3.6 cherry-pick) (argoproj#14174)

* fix(ci): python sdk release process (release-3.6) (argoproj#14183)

Signed-off-by: Alan Clucas <[email protected]>

* docs: clarify qps/burst on controller (cherry-pick argoproj#14190) (argoproj#14192)

Signed-off-by: Tim Collins <[email protected]>
Co-authored-by: Tim Collins <[email protected]>

* fix(api/jsonschema): use unchanging JSON Schema version (cherry-pick argoproj#14092) (argoproj#14256)

Signed-off-by: Roger Peppe <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>

* fix(api/jsonschema): use working `$id` (cherry-pick argoproj#14257) (argoproj#14258)

Signed-off-by: Roger Peppe <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>

* docs: autogenerate tested k8s versions and centralize config (argoproj#14176) (release-3.6) (argoproj#14262)

Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* chore(deps): bump minio-go to newer version (argoproj#14185) (release-3.6) (argoproj#14261)

Co-authored-by: Vaibhav Kaushik <[email protected]>

* fix: split pod controller from workflow controller (argoproj#14129) (release-3.6) (argoproj#14263)

* chore(deps): fix snyk (argoproj#14264) (release-3.6) (argoproj#14268)

* chore: revert to correct k8s versions

Accidental bump from argoproj#14176 cherry-pick

Signed-off-by: Alan Clucas <[email protected]>

* chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 in the go_modules group (cherry-pick argoproj#14231) (argoproj#14269)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: wait for workflow informer to sync before pod informer (cherry-pick argoproj#14248) (argoproj#14266)

Signed-off-by: Rohan K <[email protected]>
Co-authored-by: Rohan K <[email protected]>

* fix(cli): remove red from log colour selection. Fixes argoproj#6740 (cherry-pick argoproj#14215) (argoproj#14278)

Signed-off-by: Prabakaran Kumaresshan <[email protected]>
Co-authored-by: Prabakaran Kumaresshan <[email protected]>

* fix: correct semaphore configmap keys for multiple semaphores (argoproj#14184) (release-3.6) (argoproj#14281)

* fix: don't print help for non-validation errors. Fixes argoproj#14234 (cherry-pick argoproj#14249) (argoproj#14283)

Signed-off-by: Koichi Shimada <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Co-authored-by: koichi <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* docs: fix kubernetes versions (release-3.6) (argoproj#14273)

Signed-off-by: Alan Clucas <[email protected]>

* fix(workflow/sync): use RWMutex to prevent concurrent map access (cherry-pick argoproj#14321) (argoproj#14322)

Signed-off-by: Ryan Currah <[email protected]>
Co-authored-by: Ryan Currah <[email protected]>

* chore(lint): update golangci-lint to 2.1.1 (argoproj#14390) (cherry-pick release-3.6) (argoproj#14417)

* chore: bump golang 1.23->1.24 (argoproj#14385) (cherry-pick release-3.6) (argoproj#14418)

* fix: gracefully handle invalid CronWorkflows and simplify logic.  (cherry-pick argoproj#14197) (argoproj#14419)

Signed-off-by: Mason Malone <[email protected]>

* fix: prevent dfs sorter infinite recursion on cycle. Fixes argoproj#13395 (cherry-pick argoproj#14391) (argoproj#14420)

Signed-off-by: Adrien Delannoy <[email protected]>
Co-authored-by: Adrien Delannoy <[email protected]>

* chore(deps): bump github.com/expr-lang/expr from 1.16.9 to 1.17.0 (argoproj#14307) (release-3.6) (argoproj#14421)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps)!: update k8s and argo-events (release-3.6) (argoproj#14424)

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: william.vanhevelingen <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: William Van Hevelingen <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: William Van Hevelingen <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* fix: correct retry logic (argoproj#13734) (release-3.6) (argoproj#14428)

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>

* fix: manual retries exit handler cleanup. Fixes argoproj#14180 (argoproj#14181) (release-3.6) (argoproj#14429)

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>

* fix: correct manual retry logic. Fixes argoproj#14124 (argoproj#14328) (release-3.6) (argoproj#14430)

Signed-off-by: oninowang <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: jswxstw <[email protected]>

* fix: disable ALPN in argo-server as a workaround (argoproj#14433)

Signed-off-by: Alan Clucas <[email protected]>

* result of codegen

Signed-off-by: Kim <[email protected]>

* fix:lint

Signed-off-by: Kim <[email protected]>

---------

Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Tim Collins <[email protected]>
Signed-off-by: Roger Peppe <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Rohan K <[email protected]>
Signed-off-by: Prabakaran Kumaresshan <[email protected]>
Signed-off-by: Koichi Shimada <[email protected]>
Signed-off-by: Ryan Currah <[email protected]>
Signed-off-by: Adrien Delannoy <[email protected]>
Signed-off-by: william.vanhevelingen <[email protected]>
Signed-off-by: William Van Hevelingen <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: oninowang <[email protected]>
Signed-off-by: Kim <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Co-authored-by: gcp-cherry-pick-bot[bot] <98988430+gcp-cherry-pick-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Collins <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>
Co-authored-by: Mason Malone <[email protected]>
Co-authored-by: Vaibhav Kaushik <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rohan K <[email protected]>
Co-authored-by: Prabakaran Kumaresshan <[email protected]>
Co-authored-by: koichi <[email protected]>
Co-authored-by: Ryan Currah <[email protected]>
Co-authored-by: Adrien Delannoy <[email protected]>
Co-authored-by: William Van Hevelingen <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>
Co-authored-by: jswxstw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying specific failed node does not work
5 participants