Skip to content

Commit e6083c4

Browse files
committed
Introduce pod deletion timeout
* if a k8s node becomes unresponsive, the kube controller will soft delete all pods after the eviction time (default 5 mins) * as long as the node stays unresponsive, the pod will never leave the last status and hence the runner controller will assume that everything is fine with the pod and will not try to create new pods * this can result in a situation where a horizontal autoscaler thinks that none of its runners are currently busy and will not schedule any further runners / pods, resulting in a dead runner deployment until the runnerreplicaset is deleted or the node comes back online * introducing a pod deletion timeout (1 minute) after which the runner controller will try to reboot the runner and create a pod on a working node
1 parent bbb036e commit e6083c4

File tree

1 file changed

+18
-1
lines changed

1 file changed

+18
-1
lines changed

controllers/runner_controller.go

+18-1
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,24 @@ func (r *RunnerReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
185185
}
186186

187187
if !pod.ObjectMeta.DeletionTimestamp.IsZero() {
188-
return ctrl.Result{}, err
188+
deletionTimeout := 1 * time.Minute
189+
currentTime := time.Now()
190+
deletionDidTimeout := pod.DeletionTimestamp.Add(deletionTimeout).Sub(currentTime) > 0
191+
192+
if deletionDidTimeout {
193+
log.Info(
194+
"Runner failed to delete itself in a timely manner "+
195+
"Recreating the pod to see if it resolves the issue. "+
196+
"This is typically the case when a Kubernetes node became unreachable "+
197+
"and the kube contreoller started evicting nodes.",
198+
"podDeletionTimestamp", pod.DeletionTimestamp,
199+
"currentTime", currentTime,
200+
"configuredDeletionTimeout", deletionTimeout,
201+
)
202+
restart = true
203+
} else {
204+
return ctrl.Result{}, err
205+
}
189206
}
190207

191208
if pod.Status.Phase == corev1.PodRunning {

0 commit comments

Comments
 (0)