Skip to content

Add crio client timeout #3308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 25, 2023

Conversation

VicThomas-Medallia
Copy link
Contributor

Running cAdvisor in our Kubernetes clusters that use the CRI-O container runtime, we observed that for a small number of nodes, the cAdvisor pod's cAdvisor container was in a crashloop due to startup probe failure. Upon remote debugging of such containers, we found that they were stuck doing an HTTP GET via crio.sock for a given container, in the following stack trace:

runtime.gopark(proc.go:382)
runtime.selectgo(select.go:327)
net/http.(*persistConn).roundTrip(transport.go:2638)
net/http.(*Transport).roundTrip(transport.go:603)
net/http.(*Transport).RoundTrip(roundtrip.go:17)
net/http.send(client.go:252)
net/http.(*Client).send(client.go:176)
net/http.(*Client).do(client.go:716)
net/http.(*Client).Do(client.go:582)
github.com/google/cadvisor/container/crio.(*crioClientImpl).ContainerInfo(client.go:136)
github.com/google/cadvisor/container/crio.newCrioContainerHandler(handler.go:109)
github.com/google/cadvisor/container/crio.(*crioFactory).NewContainerHandler(factory.go:75)
github.com/google/cadvisor/container.NewContainerHandler(factory.go:257)
github.com/google/cadvisor/manager.(*manager).createContainerLocked(manager.go:913)
github.com/google/cadvisor/manager.(*manager).createContainer(manager.go:900)
github.com/google/cadvisor/manager.(*manager).detectSubcontainers(manager.go:1104)
github.com/google/cadvisor/manager.(*manager).Start(manager.go:300)
main.main(cadvisor.go:166)
runtime.main(proc.go:250)
runtime.goexit(asm_amd64.s:1598)
runtime.newproc(<autogenerated>:1)

It turned out that such containers for which the HTTP GET via crio.sock was hanging were containers for pods stuck in terminating. A co-worker reproduced the hang problem using crictl and submitted this issue for CRI-O.
However, another way to avoid this problem is to allow for a timeout for the crio client. That is what this pull request does.

We are currently running a fork containing the code in this pull request in our Kubernetes clusters. Using a new --crio_client_timeout flag to specify a timeout, it successfully bypasses the problem, allowing iteration over the set of detected containers to continue.

The default behavior -- when no --crio_client_timeout flag is used -- remains as is. That is, the default behavior continues to be no timeout.

@google-cla
Copy link

google-cla bot commented May 9, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@k8s-ci-robot
Copy link
Collaborator

Hi @VicThomas-Medallia. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@VicThomas-Medallia VicThomas-Medallia marked this pull request as ready for review May 11, 2023 14:00
@VicThomas-Medallia
Copy link
Contributor Author

/assign

@VicThomas-Medallia
Copy link
Contributor Author

It is good news that CRI-O has been fixed! Nonetheless, I still think this pull request has merit, as a way to defensively protect against any similar issue occurring the future and to ensure that those who cannot upgrade their CRI-O version on a timely basis can avoid the problem now.

@VicThomas-Medallia
Copy link
Contributor Author

@bobbypage - per these instructions, I will assign this pull request to you, given that you were mentioned in Slack and that you are the most active contributor in the past year. I hope that I'm following the appropriate rules.

@VicThomas-Medallia
Copy link
Contributor Author

/assign @bobbypage

@SergeyKanzhelev
Copy link
Collaborator

@rphillips can you please review?

@SergeyKanzhelev
Copy link
Collaborator

/ok-to-test

@rphillips
Copy link
Contributor

/lgtm

should be no change of behavior by default.

@VicThomas-Medallia
Copy link
Contributor Author

Thank you @SergeyKanzhelev and @rphillips.

Copy link
Collaborator

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@SergeyKanzhelev SergeyKanzhelev merged commit 137032c into google:master May 25, 2023
@VicThomas-Medallia VicThomas-Medallia deleted the add-crio-client-timeout branch May 26, 2023 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants