-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add crio client timeout #3308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crio client timeout #3308
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Hi @VicThomas-Medallia. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
It is good news that CRI-O has been fixed! Nonetheless, I still think this pull request has merit, as a way to defensively protect against any similar issue occurring the future and to ensure that those who cannot upgrade their CRI-O version on a timely basis can avoid the problem now. |
@bobbypage - per these instructions, I will assign this pull request to you, given that you were mentioned in Slack and that you are the most active contributor in the past year. I hope that I'm following the appropriate rules. |
/assign @bobbypage |
@rphillips can you please review? |
/ok-to-test |
/lgtm should be no change of behavior by default. |
Thank you @SergeyKanzhelev and @rphillips. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Running cAdvisor in our Kubernetes clusters that use the CRI-O container runtime, we observed that for a small number of nodes, the cAdvisor pod's cAdvisor container was in a crashloop due to startup probe failure. Upon remote debugging of such containers, we found that they were stuck doing an HTTP GET via crio.sock for a given container, in the following stack trace:
It turned out that such containers for which the HTTP GET via crio.sock was hanging were containers for pods stuck in terminating. A co-worker reproduced the hang problem using
crictl
and submitted this issue for CRI-O.However, another way to avoid this problem is to allow for a timeout for the crio client. That is what this pull request does.
We are currently running a fork containing the code in this pull request in our Kubernetes clusters. Using a new
--crio_client_timeout
flag to specify a timeout, it successfully bypasses the problem, allowing iteration over the set of detected containers to continue.The default behavior -- when no
--crio_client_timeout
flag is used -- remains as is. That is, the default behavior continues to be no timeout.