ARC Runner Job step getting stuck until timeout or prematurely starting next Job step while the previous step process is still in progress #4019
Unanswered
ahamednijamudeen
asked this question in
Questions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are running Github ARC controller and Runner Helm chart version
0.10.1.
in Azure Kubernetesv1.31.2
. Running the ARC inKubernetes
Kind.In our workflow, we have Job step with long running Postgres
pg_dump
process which can take couple of hours to complete due to size of the database. What I have been seeing when the Job is triggered is that workflow step stop printing any logs after few minutes even though thepg_dump
process seems to still running in the workflow pod, sometime the step just hangs without any progress and times out after 6 hours even though the process within the workflow pod seems to have finished after some time.in some cases, the JOB proceeds to the next step of
clean up
andpg_restore
while the previous step processpg_dump
is still in progress within the workflow pod.Below process output from workflow POD when the
pg_dump
was running which show the process executed via shell scriptsh
After couple of minute, the Job proceed to the next step of
clean up
and eventually topg_restore
while the previous processpg_dump
hasn't completed.. below is the process listing which shows both are active and somehow the Job Step determinedpg_dump
step is completedHere's the TOP command from workflow pod that shows both process from different step are in progress . I see the CPU utilization of
pg_dump
is close to ~100% alwayspg_dump
process running close to19
minute andpg_restore
from different step running in parallel for~2
minute.Below is the Runner POD Log during the step movement between
pg_dump
,clean_up
andpg_restore
Below is our Runner POD spec. we are using Flux CD to sync in our environment
Beta Was this translation helpful? Give feedback.
All reactions