Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argoexec kill command does not work in case of Windows container failures #14297

Open
4 tasks done
criscola opened this issue Mar 13, 2025 · 0 comments · May be fixed by #14352
Open
4 tasks done

Argoexec kill command does not work in case of Windows container failures #14297

criscola opened this issue Mar 13, 2025 · 0 comments · May be fixed by #14352
Labels

Comments

@criscola
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

This is a follow-up of #13693.

When a Windows container fails e.g. due to a crash it goes into Error state. The wait container will trigger argoexec kill command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), since os.Kill was used. Upon replacing the latter with osspecific.Kill the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 so osspecific.Kill had to be adapted to also find the PID of the argoexec process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up. Please see our fix here: main...helio:argo-workflows:fix-argoexec-kill-windows.

We have a question before submitting the PR, in osspecific.Kill we don't really know if we want to kill PID 1 or not, so maybe we'd need to extract the "find PID logic" outside of osspecific.Kill and supply its result to the kill function. However we would need a os check condition, or even better add a method to the signal_<os>.go files to retrieve the PID (in Linux it would be just a "return 1"). Any thoughts on this and the PR? Thanks!

/cc @mweibel

Version(s)

v3.6.5

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

n/a

Logs from the workflow controller

n/a

Logs from in your workflow's wait container

n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant