Skip to content

Workflow hangs indefinitely when initContainer in containerSet fails #14495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 4 tasks
hanneskaeufler opened this issue May 23, 2025 · 2 comments · Fixed by #14510
Closed
3 of 4 tasks

Workflow hangs indefinitely when initContainer in containerSet fails #14495

hanneskaeufler opened this issue May 23, 2025 · 2 comments · Fixed by #14510

Comments

@hanneskaeufler
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running a DAG step which has an initContainer, and the initContainer fails, then the other containers in the containerSet are never scheduled/never run and the Workflow hangs indefinitely.

Image

Note that the workflow fails as expected when using a container + initContainer, so this is specifically inconsistent only when using a containerSet.

Version(s)

v3.6.4

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

metadata:
  name: hello-world-containerset-initcontainerfails
  generateName: hello-world-
  namespace: my-pipeline
  labels:
    workflows.argoproj.io/archive-strategy: "false"
    workflows.argoproj.io/creator: system-serviceaccount-argo-argo-server
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
spec:
  templates:
    - name: hello-world
      inputs: {}
      outputs: {}
      metadata: {}
      dag:
        tasks:
          - name: foo
            template: foo-template
            arguments: {}
    - name: foo-template
      inputs: {}
      outputs: {}
      metadata: {}
      containerSet:
        containers:
          - name: main
            image: python:latest
            command:
              - echo
            args:
              - hello world
            resources: {}
          - name: something
            image: python:latest
            command:
              - echo
            args:
              - hello world
            resources: {}
          - name: running
            image: python:latest
            command:
              - echo
            args:
              - hello world
            resources: {}
      initContainers:
        - name: setup
          image: python:latest
          command:
            - /bin/sh
          args:
            - -c
            - sleep 1 && exit 1
          resources: {}
  entrypoint: hello-world
  arguments: {}

Logs from the workflow controller

time="2025-05-23T07:04:33.280Z" level=info msg="Processing workflow" Phase= ResourceVersion=136413 namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=info msg="Task-result reconciliation" namespace=my-pipeline numObjs=0 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=info msg="Updated phase  -> Running" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=info msg="DAG node hello-world-containerset-initcontainerfails-ck7b8 initialized Running" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=warning msg="was unable to obtain the node for hello-world-containerset-initcontainerfails-ck7b8-863968479, taskName foo"
time="2025-05-23T07:04:33.287Z" level=warning msg="was unable to obtain the node for hello-world-containerset-initcontainerfails-ck7b8-863968479, taskName foo"
time="2025-05-23T07:04:33.287Z" level=info msg="All of node hello-world-containerset-initcontainerfails-ck7b8.foo dependencies [] completed" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.287Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.288Z" level=info msg="Pod node hello-world-containerset-initcontainerfails-ck7b8-863968479 initialized Pending" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg="Created pod: hello-world-containerset-initcontainerfails-ck7b8.foo (hello-world-containerset-initcontainerfails-ck7b8-foo-template-863968479)" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg="Container node hello-world-containerset-initcontainerfails-ck7b8-535633234 initialized Pending" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg="Container node hello-world-containerset-initcontainerfails-ck7b8-2530348221 initialized Pending" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg="Container node hello-world-containerset-initcontainerfails-ck7b8-2809680178 initialized Pending" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg="TaskSet Reconciliation" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.292Z" level=info msg=reconcileAgentPod namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.298Z" level=info msg="Workflow update successful" namespace=my-pipeline phase=Running resourceVersion=136417 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.299Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=136417 namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.299Z" level=info msg="Task-result reconciliation" namespace=my-pipeline numObjs=0 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.299Z" level=info msg="node changed" namespace=my-pipeline new.message= new.phase=Pending new.progress=0/1 nodeID=hello-world-containerset-initcontainerfails-ck7b8-863968479 old.message= old.phase=Pending old.progress=0/1 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.300Z" level=info msg="TaskSet Reconciliation" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.300Z" level=info msg=reconcileAgentPod namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:33.305Z" level=info msg="Workflow update successful" namespace=my-pipeline phase=Running resourceVersion=136420 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.295Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=136420 namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.296Z" level=info msg="Task-result reconciliation" namespace=my-pipeline numObjs=0 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.296Z" level=info msg="Pod failed: Error (exit code 1)" displayName=foo namespace=my-pipeline pod=hello-world-containerset-initcontainerfails-ck7b8-foo-template-863968479 templateName=foo-template workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.296Z" level=info msg="marking node as failed since init container has non-zero exit code" namespace=my-pipeline new.phase=Failed workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.296Z" level=info msg="node changed" namespace=my-pipeline new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=hello-world-containerset-initcontainerfails-ck7b8-863968479 old.message= old.phase=Pending old.progress=0/1 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.297Z" level=info msg="TaskSet Reconciliation" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.297Z" level=info msg=reconcileAgentPod namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.307Z" level=info msg="Workflow update successful" namespace=my-pipeline phase=Running resourceVersion=136454 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.308Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=136454 namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.308Z" level=info msg="Task-result reconciliation" namespace=my-pipeline numObjs=0 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.308Z" level=info msg="Pod failed: Error (exit code 1)" displayName=foo namespace=my-pipeline pod=hello-world-containerset-initcontainerfails-ck7b8-foo-template-863968479 templateName=foo-template workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.308Z" level=info msg="marking node as failed since init container has non-zero exit code" namespace=my-pipeline new.phase=Failed workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.308Z" level=info msg="node unchanged" namespace=my-pipeline nodeID=hello-world-containerset-initcontainerfails-ck7b8-863968479 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.309Z" level=info msg="TaskSet Reconciliation" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.309Z" level=info msg=reconcileAgentPod namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:43.314Z" level=info msg="cleaning up pod" action=labelPodCompleted key=my-pipeline/hello-world-containerset-initcontainerfails-ck7b8-foo-template-863968479/labelPodCompleted
time="2025-05-23T07:04:53.309Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=136454 namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:53.309Z" level=info msg="Task-result reconciliation" namespace=my-pipeline numObjs=0 workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:53.309Z" level=info msg="TaskSet Reconciliation" namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8
time="2025-05-23T07:04:53.309Z" level=info msg=reconcileAgentPod namespace=my-pipeline workflow=hello-world-containerset-initcontainerfails-ck7b8

Logs from in your workflow's wait container

Error from server (BadRequest): container "wait" in pod "hello-world-containerset-initcontainerfails-ck7b8-foo-template-863968479" is waiting to start: PodInitializing
@chengjoey
Copy link
Member

Have you tried the latest version? I can't reproduce it in the latest version.

@jswxstw
Copy link
Member

jswxstw commented May 27, 2025

@chengjoey I've tested it with the latest(main) and reproduced it. However, the workflow will not stuck running indefinitely, it will fail on the next resync.
#13858 also mentioned this issue, but the focus was not on workflow stucking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants