bug: Meta pod does not restart if it looses leadership #8238

CAJan93 · 2023-02-28T14:36:46Z

Describe the bug

Meta pod does not restart if it looses leadership. Instead the meta pod stays in an unknown state.

2023-02-28T14:23:52.460829Z ERROR risingwave_meta::rpc::election_client: keep alive failed, stopping main loop
2023-02-28T14:23:52.460848Z  WARN risingwave_meta::rpc::election_client: client testinstance-meta-default-1.testinstance-meta:5690 lost leadership

When looking at the MembersResponse of the meta pod, I get

address:{host:"testinstance-meta-default-2.testinstance-meta"  port:5690}  is_leader:true
address:{host:"testinstance-meta-default-0.testinstance-meta"  port:5690}]

testinstance-meta-default-1, the former leader is not listed.

To Reproduce

Start a setup with multiple meta nodes
Force delete the etcd pod
Leader will loose leadership and stay in unknown state

Expected behavior

My expectation would have been that fencing kicks in once the pod looses leadership. The pod can then restart as a follower.

Additional context

Used image: nightly-20230227

@arkbriar We could also do the fencing in the operator, but I believe doing it here will be quicker.

CC @shanicky

The text was updated successfully, but these errors were encountered:

lmatz · 2023-02-28T15:12:02Z

I am curious about under what situations the etcd pod gets deleted/killed in real world cases

shanicky · 2023-02-28T15:16:36Z

Describe the bug

Meta pod does not restart if it looses leadership. Instead the meta pod stays in an unknown state.
2023-02-28T14:23:52.460829Z ERROR risingwave_meta::rpc::election_client: keep alive failed, stopping main loop
2023-02-28T14:23:52.460848Z  WARN risingwave_meta::rpc::election_client: client testinstance-meta-default-1.testinstance-meta:5690 lost leadership
When looking at the MembersResponse of the meta pod, I get
address:{host:"testinstance-meta-default-2.testinstance-meta"  port:5690}  is_leader:true
address:{host:"testinstance-meta-default-0.testinstance-meta"  port:5690}]
testinstance-meta-default-1, the former leader is not listed.

To Reproduce

Start a setup with multiple meta nodes

Force delete the etcd pod

Leader will loose leadership and stay in unknown state

Expected behavior

My expectation would have been that fencing kicks in once the pod looses leadership. The pod can then restart as a follower.

Additional context

Used image: nightly-20230227

@arkbriar We could also do the fencing in the operator, but I believe doing it here will be quicker.

CC @shanicky

It seems to be a problem caused by the meta not being able to gracefully shutdown, the election client already thinks it has lost leadership and has entered the lost leader process, but the whole program can't exit for some reason

shanicky · 2023-02-28T15:24:11Z

I am curious about under what situations the etcd pod gets deleted/killed

If the connection to etcd is lost, meta's election client will attempt to rebuild the connection within ttl/2 time after the last successful keep-alive.

If the connection is successfully rebuilt:

If the data in etcd is still present and current meta is still the leader, the leader lease will continue to be kept alive.
If the data in etcd is still present, but meta loses leadership after reconnecting, it will exit proactively.
If etcd has been cleared, the leader lease will expire and meta will lose leadership and exit proactively.

These three branches are part of a Tokio select.

If the connection attempt fails (e.g., due to etcd being down) or if there is a master switch in a multi-node etcd cluster, meta will attempt to reconnect within ttl/2 time. If the connection timeout is exceeded, meta will assume it has lost leadership and will exit proactively.

shanicky · 2023-03-01T05:39:19Z

When etcd is killed, other managers may be writing to etcd. In this case, they will attempt to retry and reconnect to etcd within a certain timeout period. This may cause the election client to believe the leadership has been lost, but the program has not yet exited.

shanicky · 2023-03-01T05:51:37Z

Could you provide a more complete pod distribution, such as the output of 'kubectl get pod'? @CAJan93

CAJan93 · 2023-03-01T09:35:47Z

I am curious about under what situations the etcd pod gets deleted/killed in real world cases

We probably will never have that, but what would be enough would be a network issue between the meta leader pod and the etcd pod.

Also, even if it is unlikely, I think we should prepare for a case when ETCD actually crashes.

CAJan93 · 2023-03-01T11:09:47Z

Comments

If the connection to etcd is lost, meta's election client will attempt to rebuild the connection within ttl/2 time after the last successful keep-alive.

@shanicky Is ttl equivalent to meta_leader_lease_secs?

It seems to be a problem caused by the meta not being able to gracefully shutdown

Maybe it is also because the key in ETCD is gone? Maybe we would not have this if we would crash ETCD and then re-populate the new ETCD instance with the old keys?

@shanicky Please let me know if I misunderstand the lease below or if you think that this is a non-issue.

I ran the below experience twice with the same outcome:

Experiment

I am only using etcd and meta in my RW resource to simplify things here. Image is nightly-20230207

Before ETCD incidence

k get pods -A                     

NAMESPACE                    NAME                                                      READY   STATUS    RESTARTS        AGE
cert-manager                 cert-manager-cainjector-857ff8f7cb-lbrp9                  1/1     Running   0               3m32s
cert-manager                 cert-manager-d58554549-lbbrg                              1/1     Running   0               3m32s
cert-manager                 cert-manager-webhook-76fdf7c485-rckq5                     1/1     Running   0               3m32s
default                      rwproxy-5b76f598f4-g78qk                                  1/1     Running   0               5m3s
kube-system                  coredns-565d847f94-fv949                                  1/1     Running   0               5m26s
kube-system                  coredns-565d847f94-nqbfx                                  1/1     Running   0               5m26s
kube-system                  etcd-onebox-control-plane                                 1/1     Running   0               5m42s
kube-system                  kindnet-gt6r6                                             1/1     Running   0               5m26s
kube-system                  kube-apiserver-onebox-control-plane                       1/1     Running   0               5m40s
kube-system                  kube-controller-manager-onebox-control-plane              1/1     Running   0               5m40s
kube-system                  kube-proxy-hlpqg                                          1/1     Running   0               5m26s
kube-system                  kube-scheduler-onebox-control-plane                       1/1     Running   0               5m40s
local-path-storage           local-path-provisioner-684f458cdd-b6sfx                   1/1     Running   0               5m26s
risingwave-operator-system   risingwave-operator-controller-manager-58c5798c56-b9v56   2/2     Running   0               2m26s
rwc-2-mytenant               risingwave-etcd-mytenant-0                                1/1     Running   1 (2m32s ago)   2m46s
rwc-2-mytenant               risingwave-meta-default-0                                 1/1     Running   0               2m6s
rwc-2-mytenant               risingwave-meta-default-1                                 1/1     Running   0               37s
rwc-2-mytenant               risingwave-meta-default-2                                 1/1     Running   0               37s

I am using this rw operator. It adds a role label to the meta pods

Status at the beginning of the experiment

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          5m28s   leader
risingwave-meta-default-1   1/1     Running   0          3m59s   follower
risingwave-meta-default-2   1/1     Running   0          3m59s   follower

Let's confirm that using the logs:

for i in {0..2}; do echo meta-$i; k logs risingwave-meta-default-${i}  | grep "Defining leader services"; done
meta-0
2023-03-01T09:57:16.595646Z  INFO risingwave_meta::rpc::server: Defining leader services
meta-1
meta-2

Let's observe the change in the status of the meta pods using k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta --watch

ETCD content is odd?

What does ETCD say about who the leader is right now?

k port-forward svc/risingwave-etcd-mytenant 2379 &
etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e221d"
"CreateRevision" : 2
"ModRevision" : 2
"Version" : 1
"Value" : "risingwave-meta-default-0.risingwave-meta:5690"
"Lease" : 4734838590407582237

"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

If we view the logs of our meta pods, we see that meta_leader_lease_secs: 10, so "Lease" : 4734838590407582514 looks a bit odd to me, since this is 97 years from now (see this timestamp converter). Do I misunderstand the meaning of lease? Is this not the time when this value expires? Or am I looking at the wrong key here? Is it not __meta_election_?

ETCD incidence and effects on meta

Since our lease never expires, simply blocking the network between the leader pod and ETCD does not help. Let's simulate lease expiration by deleting the current leader lease

etcdctl --endpoints=127.0.0.1:2379 del __meta_election_/41b5869c9a0e221d
etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

The former leader gets into an unknown state and meta-1 is elected as leader. meta-0 did not restart

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta --watch
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          10m     leader
risingwave-meta-default-1   1/1     Running   0          8m43s   follower
risingwave-meta-default-2   1/1     Running   0          8m43s   follower
risingwave-meta-default-0   1/1     Running   0          54m     unknown
risingwave-meta-default-1   1/1     Running   0          52m     leader

# meta-1 is the actual leader according to the logs
k logs risingwave-meta-default-1 | grep "Defining leader services"
2023-03-01T09:57:16.595646Z  INFO risingwave_meta::rpc::server: Defining leader services

# original leader meta-0 stuck 
k logs risingwave-meta-default-0 | tail
2023-03-01T10:03:48.949202Z  INFO risingwave_meta::hummock::compactor_manager: Removed compactor session 2
2023-03-01T10:03:48.953430Z  INFO risingwave_meta::hummock::manager::worker: Released hummock context 2
2023-03-01T10:03:49.949167Z  WARN risingwave_meta::manager::cluster: Deleted expired worker 1 HostAddress {
    host: "10.244.0.20",
    port: 6660,
}, current timestamp 1677665029
2023-03-01T10:03:49.949202Z  INFO risingwave_meta::hummock::compactor_manager: Removed compactor session 1
2023-03-01T10:03:49.951419Z  INFO risingwave_meta::hummock::manager::worker: Released hummock context 1
2023-03-01T10:50:53.755999Z  WARN risingwave_meta::rpc::election_client: leader has been changed to risingwave-meta-default-1.risingwave-meta:5690
2023-03-01T10:50:53.756212Z  WARN risingwave_meta::rpc::election_client: client risingwave-meta-default-0.risingwave-meta:5690 lost leadership

Status after about 5 to 10 minutes:

etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta
NAME                        READY   STATUS    RESTARTS   AGE   META-ROLE
risingwave-meta-default-0   1/1     Running   0          62m   unknown
risingwave-meta-default-1   1/1     Running   0          60m   leader
risingwave-meta-default-2   1/1     Running   0          60m   follower

k logs risingwave-meta-default-0 | tail -1
2023-03-01T10:50:53.756212Z  WARN risingwave_meta::rpc::election_client: client risingwave-meta-default-0.risingwave-meta:5690 lost leadership

Expected behaviour

I would have expected that this happens:

If etcd has been cleared, the leader lease will expire and meta will lose leadership and exit proactively.

CAJan93 · 2023-03-01T11:45:17Z

Okay, I did misunderstand the meaning of lease. The timeout of the lease seems to be correct:

etcdctl lease list
found 2 leases
41b5869cf076776f
41b5869cf0767775

etcdctl lease timetolive 41b5869cf076776f
lease 41b5869cf076776f granted with TTL(10s), remaining(9s)

etcdctl lease timetolive 41b5869cf0767775
lease 41b5869cf0767775 granted with TTL(10s), remaining(9s)

shanicky · 2023-03-21T14:55:51Z

You can directly use the command-line tool of the election module to check the leader. 😃

shanicky · 2023-03-23T07:22:37Z

please retry the logic after #8734

CAJan93 · 2023-03-23T10:14:44Z

Tried it using your PR. Worked really well.

k get pods  -L risingwave/meta-role -l risingwave/component=meta --watch
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          3m43s   leader
risingwave-meta-default-1   1/1     Running   0          5m10s   follower
# delete lease here
risingwave-meta-default-0   0/1     Completed   0          5m4s    leader
risingwave-meta-default-0   0/1     Completed   0          5m4s    unknown
risingwave-meta-default-0   0/1     Running     1 (1s ago)   5m5s    unknown
risingwave-meta-default-1   1/1     Running     0            6m32s   leader
risingwave-meta-default-0   0/1     Running     1 (1s ago)   5m5s    follower
risingwave-meta-default-0   1/1     Running     1 (2s ago)   5m6s    follower

Let's try with the graceful shutdown.

CAJan93 added the type/bug Type: Bug. Only for issues. label Feb 28, 2023

github-actions bot added this to the release-0.1.18 milestone Feb 28, 2023

shanicky self-assigned this Feb 28, 2023

CAJan93 closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Meta pod does not restart if it looses leadership #8238

bug: Meta pod does not restart if it looses leadership #8238

CAJan93 commented Feb 28, 2023 •

edited

Loading

lmatz commented Feb 28, 2023 •

edited

Loading

Uh oh!

shanicky commented Feb 28, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

Uh oh!

shanicky commented Feb 28, 2023 •

edited

Loading

Uh oh!

shanicky commented Mar 1, 2023

Uh oh!

shanicky commented Mar 1, 2023

Uh oh!

CAJan93 commented Mar 1, 2023 •

edited

Loading

Uh oh!

CAJan93 commented Mar 1, 2023 •

edited

Loading

Uh oh!

CAJan93 commented Mar 1, 2023

Uh oh!

shanicky commented Mar 21, 2023

Uh oh!

shanicky commented Mar 23, 2023

Uh oh!

CAJan93 commented Mar 23, 2023

Uh oh!

bug: Meta pod does not restart if it looses leadership #8238

bug: Meta pod does not restart if it looses leadership #8238

Comments

CAJan93 commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

To Reproduce

Expected behavior

Additional context

lmatz commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shanicky commented Feb 28, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

Uh oh!

shanicky commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shanicky commented Mar 1, 2023

Uh oh!

shanicky commented Mar 1, 2023

Uh oh!

CAJan93 commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAJan93 commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments

Experiment

Before ETCD incidence

ETCD content is odd?

ETCD incidence and effects on meta

Expected behaviour

Uh oh!

CAJan93 commented Mar 1, 2023

Uh oh!

shanicky commented Mar 21, 2023

Uh oh!

shanicky commented Mar 23, 2023

Uh oh!

CAJan93 commented Mar 23, 2023

Uh oh!

CAJan93 commented Feb 28, 2023 •

edited

Loading

lmatz commented Feb 28, 2023 •

edited

Loading

shanicky commented Feb 28, 2023 •

edited

Loading

CAJan93 commented Mar 1, 2023 •

edited

Loading

CAJan93 commented Mar 1, 2023 •

edited

Loading