Add crio client timeout #3308

VicThomas-Medallia · 2023-05-09T15:57:03Z

Running cAdvisor in our Kubernetes clusters that use the CRI-O container runtime, we observed that for a small number of nodes, the cAdvisor pod's cAdvisor container was in a crashloop due to startup probe failure. Upon remote debugging of such containers, we found that they were stuck doing an HTTP GET via crio.sock for a given container, in the following stack trace:

runtime.gopark(proc.go:382)
runtime.selectgo(select.go:327)
net/http.(*persistConn).roundTrip(transport.go:2638)
net/http.(*Transport).roundTrip(transport.go:603)
net/http.(*Transport).RoundTrip(roundtrip.go:17)
net/http.send(client.go:252)
net/http.(*Client).send(client.go:176)
net/http.(*Client).do(client.go:716)
net/http.(*Client).Do(client.go:582)
github.com/google/cadvisor/container/crio.(*crioClientImpl).ContainerInfo(client.go:136)
github.com/google/cadvisor/container/crio.newCrioContainerHandler(handler.go:109)
github.com/google/cadvisor/container/crio.(*crioFactory).NewContainerHandler(factory.go:75)
github.com/google/cadvisor/container.NewContainerHandler(factory.go:257)
github.com/google/cadvisor/manager.(*manager).createContainerLocked(manager.go:913)
github.com/google/cadvisor/manager.(*manager).createContainer(manager.go:900)
github.com/google/cadvisor/manager.(*manager).detectSubcontainers(manager.go:1104)
github.com/google/cadvisor/manager.(*manager).Start(manager.go:300)
main.main(cadvisor.go:166)
runtime.main(proc.go:250)
runtime.goexit(asm_amd64.s:1598)
runtime.newproc(<autogenerated>:1)

It turned out that such containers for which the HTTP GET via crio.sock was hanging were containers for pods stuck in terminating. A co-worker reproduced the hang problem using crictl and submitted this issue for CRI-O.
However, another way to avoid this problem is to allow for a timeout for the crio client. That is what this pull request does.

We are currently running a fork containing the code in this pull request in our Kubernetes clusters. Using a new --crio_client_timeout flag to specify a timeout, it successfully bypasses the problem, allowing iteration over the set of detected containers to continue.

The default behavior -- when no --crio_client_timeout flag is used -- remains as is. That is, the default behavior continues to be no timeout.

…eout

google-cla · 2023-05-09T15:57:07Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

k8s-ci-robot · 2023-05-09T15:57:13Z

Hi @VicThomas-Medallia. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

VicThomas-Medallia · 2023-05-15T21:49:04Z

/assign

VicThomas-Medallia · 2023-05-23T18:47:00Z

It is good news that CRI-O has been fixed! Nonetheless, I still think this pull request has merit, as a way to defensively protect against any similar issue occurring the future and to ensure that those who cannot upgrade their CRI-O version on a timely basis can avoid the problem now.

VicThomas-Medallia · 2023-05-23T18:51:29Z

@bobbypage - per these instructions, I will assign this pull request to you, given that you were mentioned in Slack and that you are the most active contributor in the past year. I hope that I'm following the appropriate rules.

VicThomas-Medallia · 2023-05-23T18:51:39Z

/assign @bobbypage

SergeyKanzhelev · 2023-05-23T18:57:49Z

@rphillips can you please review?

SergeyKanzhelev · 2023-05-23T18:58:08Z

/ok-to-test

rphillips · 2023-05-23T19:02:06Z

/lgtm

should be no change of behavior by default.

VicThomas-Medallia · 2023-05-25T13:36:06Z

Thank you @SergeyKanzhelev and @rphillips.

SergeyKanzhelev

/lgtm

VicThomas-Medallia added 3 commits April 28, 2023 18:46

Add CRI-O client timeout setting

93f3479

Restructure

44fafc6

Merge remote-tracking branch 'origin/master' into add-crio-client-tim…

fb569ce

…eout

k8s-ci-robot added the needs-ok-to-test label May 9, 2023

VicThomas-Medallia marked this pull request as ready for review May 11, 2023 14:00

Ezetowers mentioned this pull request May 23, 2023

Container status cannot be retrieved when pod is in Terminating state cri-o/cri-o#6865

Closed

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels May 23, 2023

SergeyKanzhelev approved these changes May 25, 2023

View reviewed changes

SergeyKanzhelev merged commit 137032c into google:master May 25, 2023

VicThomas-Medallia deleted the add-crio-client-timeout branch May 26, 2023 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add crio client timeout #3308

Add crio client timeout #3308

Uh oh!

VicThomas-Medallia commented May 9, 2023

Uh oh!

google-cla bot commented May 9, 2023

Uh oh!

k8s-ci-robot commented May 9, 2023

Uh oh!

VicThomas-Medallia commented May 15, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

SergeyKanzhelev commented May 23, 2023

Uh oh!

SergeyKanzhelev commented May 23, 2023

Uh oh!

rphillips commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 25, 2023

Uh oh!

SergeyKanzhelev left a comment

Uh oh!

Uh oh!

Add crio client timeout #3308

Add crio client timeout #3308

Uh oh!

Conversation

VicThomas-Medallia commented May 9, 2023

Uh oh!

google-cla bot commented May 9, 2023

Uh oh!

k8s-ci-robot commented May 9, 2023

Uh oh!

VicThomas-Medallia commented May 15, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 23, 2023

Uh oh!

SergeyKanzhelev commented May 23, 2023

Uh oh!

SergeyKanzhelev commented May 23, 2023

Uh oh!

rphillips commented May 23, 2023

Uh oh!

VicThomas-Medallia commented May 25, 2023

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!