-
Notifications
You must be signed in to change notification settings - Fork 113
Add liveness and startup/readiness probes #139
Conversation
Hi @erikgb. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/cc @rjbez17
Just to confirm - since these checks don't hook into any HNC logic (currently), what are they actually testing that K8s wasn't already to detect? E.g. I would have assumed that if the pod crashed on startup we'd already know. So are we actually flagging any new failure modes here?
Or is this more to be compliant with clusters that might have a rule like "all pods should have readiness and liveness checks?"
Thanks!
As I tried to say in our call last Friday, we could probably do more to get correct health assessment in HNC, as this does not touch the non-standard controller code in HNC. So consider this a light start. I am not exactly sure what the probe actually checks by default in controller-runtime, but I would expect it to not mark a container as ready before the cache is synced. And it seems to be working fine for all our controllers, that has both reconcilers and admission webhooks (of all kinds). I am observing a bunch of API calls failing because the HNC webhook is configured in the API server, but no webhook endpoint ready to handle the (totally unrelated) request. This is not optimal, and affects general cluster availability! I would like the webhook-endpoint to be highly available, as it is critical for my cluster. |
Yes, I know that hooking into the HNC logic would be a good next step, I'm just curious as to what this PR does in isolation. I forgot that the cache sync was a possibility. Does it make sense for you to try this in a cluster and see if it improves the problems you're describing? |
Are PR images available in any public registry? We run multi-tenant clusters, so it is not always that easy to test work-in-progress like this - since I am not personally a cluster-admin. We practice GitOps, and I think creating a new pipeline just to test this can be a bit tricky. But I can try to see what can be done. I would not expect this PR to fully fix our problem, since I believe HNC requires additional time to "warm up". |
No, you'd have to build the image yourself and deploy it to a test cluster.
I wouldn't try this in prod :)
Yeah agreed, I'd just be curious as to whether the probes gave any additional information at all. |
I am trying to base the health check on: But HNC seems unable to start the cert-rotator successfully when enabling this. I am looking into it, but please let me know if you have any ideas. |
Maybe paste more information here so that others can have a look at what you're seeing? |
With the diff --git a/cmd/manager/main.go b/cmd/manager/main.go
index b4117677..90db78d6 100644
--- a/cmd/manager/main.go
+++ b/cmd/manager/main.go
@@ -236,7 +236,8 @@ func createManager() ctrl.Manager {
setupLog.Error(err, "unable to set up health check")
os.Exit(1)
}
- if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+ readyChecker := mgr.GetWebhookServer().StartedChecker()
+ if err := mgr.AddReadyzCheck("readyz", readyChecker); err != nil {
setupLog.Error(err, "unable to set up ready check") Nothing works, and so far I haven't got anything interesting out of the logs: {"level":"info","ts":1644597871.7143803,"logger":"setup","msg":"Starting main.go:init()"}
{"level":"info","ts":1644597871.7166357,"logger":"setup","msg":"Finished main.go:init()"}
{"level":"info","ts":1644597871.7166703,"logger":"setup","msg":"Parsing flags"}
{"level":"info","ts":1644597871.7167115,"logger":"setup","msg":"Creating OpenCensus->Stackdriver exporter"}
{"level":"error","ts":1644597871.7183151,"logger":"setup","msg":"Could not create Stackdriver exporter","error":"stackdriver: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.","stacktrace":"main.main\n\t/workspace/cmd/manager/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}
{"level":"info","ts":1644597871.7183697,"logger":"setup","msg":"Creating Prometheus exporter"}
{"level":"info","ts":1644597871.7184758,"logger":"setup","msg":"Configuring controller-manager"}
{"level":"info","ts":1644597871.7393453,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644597871.739669,"logger":"setup","msg":"Starting certificate generation"}
{"level":"info","ts":1644597871.73979,"logger":"setup","msg":"Starting manager"}
{"level":"info","ts":1644597871.739821,"logger":"setup","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":1644597871.7400708,"logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":1644597871.7401395,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1644597871.740153,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1644597871.7401528,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1644597871.7402065,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1644597871.7402062,"logger":"cert-rotation","msg":"starting cert rotator controller"}
{"level":"info","ts":1644597871.740309,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*v1.Secret=&{{ } { 0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] [] []} <nil> map[] map[] }) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403488,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:ValidatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.740357,"logger":"controller.cert-rotator","msg":"Starting EventSource","source":"&{{%!s(*unstructured.Unstructured=&{map[apiVersion:admissionregistration.k8s.io/v1 kind:MutatingWebhookConfiguration]}) %!s(*cache.informerCache=&{0xc0006fbdc0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
{"level":"info","ts":1644597871.7403603,"logger":"controller.cert-rotator","msg":"Starting Controller"}
{"level":"error","ts":1644597871.7403653,"logger":"controller.cert-rotator","msg":"Could not wait for Cache to sync","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"error","ts":1644597871.7403746,"msg":"error received after stop sequence was engaged","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced"}
{"level":"info","ts":1644597901.7410984,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1644597901.7412927,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1644597901.7413201,"msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":1644597901.7410095,"logger":"setup","msg":"problem running manager","error":"[open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]","errorCauses":[{"error":"open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directory"},{"error":"failed waiting for all runnables to end within grace period of 30s: context deadline exceeded"}],"stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255"} |
b6c52fe
to
f9553f6
Compare
@adrianludwin Even if the CI reports this PR as ok, I am unable to make it work in my local k3s-cluster. And I do not understand why, but seems to be related to cert-rotator - which I don't know. We use cert-manager for all cert-related work. Could you test this PR - in some of your environments, please? /hold |
/lgtm cancel |
@erikgb Can you please fix the commit messages? You can say that you couldn't get it to work, but that I could. |
/lgtm cancel |
Done! |
This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available. Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. On request, Adrian Ludwin took a look at my issues, and somehow this seems to work for him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
I'm comfortable with this given my testing results.
Checking with @rjbez17 to see if he wants to review, otherwise I'll remove the hold later today. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianludwin, erikgb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ryan appears to be away so let's just put this in. /hold cancel |
This adds liveness and startup/readiness probes to the HNC Deployment using functionality recently added to controller-runtime. The probes configuration might need some adjustments of timeouts etc. to work without issues for a typical HNC deployment. The default configuration should probably be adjusted; timeouts etc., and should be tested in a live cluster - which I do not have available.
Testing: Ran all tests successfully on my local workstation, when using the simple ping probe. But when changing to the deeper webhook probe, the container never becomes ready. Seems to be an issue related to cert-rotator - which I do not know at all. We use cert-manager. HELP!
Fixes #68
/kind feature
/cc @adrianludwin