Argoexec kill command does not work in case of Windows container failures #14297

criscola · 2025-03-13T15:24:01Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

This is a follow-up of #13693.

When a Windows container fails e.g. due to a crash it goes into Error state. The wait container will trigger argoexec kill command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), since os.Kill was used. Upon replacing the latter with osspecific.Kill the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 so osspecific.Kill had to be adapted to also find the PID of the argoexec process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up. Please see our fix here: main...helio:argo-workflows:fix-argoexec-kill-windows.

We have a question before submitting the PR, in osspecific.Kill we don't really know if we want to kill PID 1 or not, so maybe we'd need to extract the "find PID logic" outside of osspecific.Kill and supply its result to the kill function. However we would need a os check condition, or even better add a method to the signal_<os>.go files to retrieve the PID (in Linux it would be just a "return 1"). Any thoughts on this and the PR? Thanks!

/cc @mweibel

Version(s)

v3.6.5

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

n/a

Logs from the workflow controller

n/a

Logs from in your workflow's wait container

n/a

The text was updated successfully, but these errors were encountered:

criscola added the type/bug label Mar 13, 2025

criscola linked a pull request Apr 4, 2025 that will close this issue

fix: argoexec kill command in Windows to use osspecific.Kill #14352

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argoexec kill command does not work in case of Windows container failures #14297

Argoexec kill command does not work in case of Windows container failures #14297

criscola commented Mar 13, 2025

Argoexec kill command does not work in case of Windows container failures #14297

Argoexec kill command does not work in case of Windows container failures #14297

Comments

criscola commented Mar 13, 2025

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container