Ingesters failing to leave the ring in GKE #4467

andrejbranch · 2021-09-08T19:05:34Z

Describe the bug
Im seeing an issue with ingesters sometimes failing to leave the ring. This seems to happen no matter which kv store is used. It looks as though there is a race condition with closing the lifecycler loop and leaving the ring. Below is example logs of using etcd as kv store.

cortex-ingester-5 cortex level=info ts=2021-09-08T17:47:21.918300489Z caller=lifecycler.go:754 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester
cortex-ingester-5 cortex {"level":"warn","ts":"2021-09-08T17:47:42.803Z","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008901c0/#initially=[cortex-etcd-0.cortex-etcd:2379;cortex-etcd-1.cortex-etcd:2379;cortex-etcd-2.cortex-etcd:2379]","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}

To Reproduce
Steps to reproduce the behavior:
I've been able to reproduce by starting from a completely blank deployment, spin up some ingesters and connect them to the ring. Do a rolling restart on them and all looks good. Every ingester leaves the ring and rejoins properly. After the rolling restart is done, do another rolling restart and some ingesters fail to leave the ring. It doesnt matter if I use memberlist, or etcd, or consul.

Expected behavior
Ingesters should leave the ring no matter how many times they are restarted when unregister on shutdown is true

Environment:

Infrastructure: kubernetes
Deployment tool: N/A

Storage Engine

Blocks
Chunks

Additional Context
I found this bug when testing a lower replication factor. I'm wondering if this is missed by most deployments because the replication factor of 3 with extending writes hides the issue. With a lower replication factor if an ingester fails to leave the ring all writes fail.

The text was updated successfully, but these errors were encountered:

bboreham · 2021-09-10T16:52:28Z

Those messages come from the etcd library:

cortex/vendor/go.etcd.io/etcd/client/v3/retry_interceptor.go

Line 63 in df9af3a

"retrying of unary invoker failed",

cortex/vendor/go.etcd.io/etcd/client/v3/client.go

Line 583 in df9af3a

return s.Code() == codes.Canceled || s.Message() == "transport is closing"

Can you post the logs leading up to that point, so we can figure out how it gets into that state?

andrejbranch · 2021-09-10T17:48:58Z

So im closing this issue because it was not a cortex problem at all.

Terminating pods were prematurely losing connection to other pods instantly and looked like a GKE issue. I opened a case with them and dug deeper into kubelet and containerd but found nothing interesting.

Finally i found that GKE doesnt seem to be handling named ports correctly. The headless service which was braking in GKE looked like this:

  ports:
    - port: 7946
      protocol: TCP
      name: memberlist-port
      targetPort: memberlist-port
    - port: 7946
      protocol: UDP
      name: udp-port
      targetPort: udp-port

when looking at GKE UI details of the headless service i noticed the target port here being 0

after removing the named target ports like this:

  clusterIP: None
  ports:
    - port: 7946
      protocol: TCP
      name: memberlist-port
      targetPort: 7946
    - port: 7946
      protocol: UDP
      name: udp-port
      targetPort: 7946

the UI changed to this:

and the issue no longer occurs! I have followed up with GCP to see if this is expected behavior or not.

this issue would happen for any kv store behind a headless service with named ports in gke

andrejbranch closed this as completed Sep 10, 2021

andrejbranch changed the title ~~Ingesters failing to leave the ring~~ Ingesters failing to leave the ring in GKE Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesters failing to leave the ring in GKE #4467

Ingesters failing to leave the ring in GKE #4467

andrejbranch commented Sep 8, 2021

bboreham commented Sep 10, 2021

andrejbranch commented Sep 10, 2021 •

edited

Loading

Ingesters failing to leave the ring in GKE #4467

Ingesters failing to leave the ring in GKE #4467

Comments

andrejbranch commented Sep 8, 2021

bboreham commented Sep 10, 2021

andrejbranch commented Sep 10, 2021 • edited Loading

andrejbranch commented Sep 10, 2021 •

edited

Loading