Skip to content

Ingesters failing to leave the ring in GKE #4467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
andrejbranch opened this issue Sep 8, 2021 · 2 comments
Closed
1 of 2 tasks

Ingesters failing to leave the ring in GKE #4467

andrejbranch opened this issue Sep 8, 2021 · 2 comments

Comments

@andrejbranch
Copy link

Describe the bug
Im seeing an issue with ingesters sometimes failing to leave the ring. This seems to happen no matter which kv store is used. It looks as though there is a race condition with closing the lifecycler loop and leaving the ring. Below is example logs of using etcd as kv store.

cortex-ingester-5 cortex level=info ts=2021-09-08T17:47:21.918300489Z caller=lifecycler.go:754 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester
cortex-ingester-5 cortex {"level":"warn","ts":"2021-09-08T17:47:42.803Z","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008901c0/#initially=[cortex-etcd-0.cortex-etcd:2379;cortex-etcd-1.cortex-etcd:2379;cortex-etcd-2.cortex-etcd:2379]","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}

To Reproduce
Steps to reproduce the behavior:
I've been able to reproduce by starting from a completely blank deployment, spin up some ingesters and connect them to the ring. Do a rolling restart on them and all looks good. Every ingester leaves the ring and rejoins properly. After the rolling restart is done, do another rolling restart and some ingesters fail to leave the ring. It doesnt matter if I use memberlist, or etcd, or consul.

Expected behavior
Ingesters should leave the ring no matter how many times they are restarted when unregister on shutdown is true

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: N/A

Storage Engine

  • Blocks
  • Chunks

Additional Context
I found this bug when testing a lower replication factor. I'm wondering if this is missed by most deployments because the replication factor of 3 with extending writes hides the issue. With a lower replication factor if an ingester fails to leave the ring all writes fail.

@bboreham
Copy link
Contributor

Those messages come from the etcd library:

"retrying of unary invoker failed",

return s.Code() == codes.Canceled || s.Message() == "transport is closing"

Can you post the logs leading up to that point, so we can figure out how it gets into that state?

@andrejbranch
Copy link
Author

andrejbranch commented Sep 10, 2021

So im closing this issue because it was not a cortex problem at all.

Terminating pods were prematurely losing connection to other pods instantly and looked like a GKE issue. I opened a case with them and dug deeper into kubelet and containerd but found nothing interesting.

Finally i found that GKE doesnt seem to be handling named ports correctly. The headless service which was braking in GKE looked like this:

  ports:
    - port: 7946
      protocol: TCP
      name: memberlist-port
      targetPort: memberlist-port
    - port: 7946
      protocol: UDP
      name: udp-port
      targetPort: udp-port

when looking at GKE UI details of the headless service i noticed the target port here being 0
Screen Shot 2021-09-10 at 11 21 42 AM

after removing the named target ports like this:

  clusterIP: None
  ports:
    - port: 7946
      protocol: TCP
      name: memberlist-port
      targetPort: 7946
    - port: 7946
      protocol: UDP
      name: udp-port
      targetPort: 7946

the UI changed to this:

Screen Shot 2021-09-10 at 11 23 22 AM

and the issue no longer occurs! I have followed up with GCP to see if this is expected behavior or not.

this issue would happen for any kv store behind a headless service with named ports in gke

@andrejbranch andrejbranch changed the title Ingesters failing to leave the ring Ingesters failing to leave the ring in GKE Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants