You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
When a Windows container fails e.g. due to a crash it goes into Error state. The wait container will trigger argoexec kill command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), since os.Kill was used. Upon replacing the latter with osspecific.Kill the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 so osspecific.Kill had to be adapted to also find the PID of the argoexec process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up. Please see our fix here: main...helio:argo-workflows:fix-argoexec-kill-windows.
We have a question before submitting the PR, in osspecific.Kill we don't really know if we want to kill PID 1 or not, so maybe we'd need to extract the "find PID logic" outside of osspecific.Kill and supply its result to the kill function. However we would need a os check condition, or even better add a method to the signal_<os>.go files to retrieve the PID (in Linux it would be just a "return 1"). Any thoughts on this and the PR? Thanks!
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
This is a follow-up of #13693.
When a Windows container fails e.g. due to a crash it goes into Error state. The
wait
container will triggerargoexec kill
command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), sinceos.Kill
was used. Upon replacing the latter withosspecific.Kill
the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 soosspecific.Kill
had to be adapted to also find the PID of theargoexec
process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up. Please see our fix here: main...helio:argo-workflows:fix-argoexec-kill-windows.We have a question before submitting the PR, in
osspecific.Kill
we don't really know if we want to kill PID 1 or not, so maybe we'd need to extract the "find PID logic" outside ofosspecific.Kill
and supply its result to the kill function. However we would need a os check condition, or even better add a method to thesignal_<os>.go
files to retrieve the PID (in Linux it would be just a "return 1"). Any thoughts on this and the PR? Thanks!/cc @mweibel
Version(s)
v3.6.5
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
n/a
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: