Skip to content

bug: Meta pod does not restart if it looses leadership #8238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CAJan93 opened this issue Feb 28, 2023 · 11 comments
Closed

bug: Meta pod does not restart if it looses leadership #8238

CAJan93 opened this issue Feb 28, 2023 · 11 comments
Assignees
Labels
type/bug Type: Bug. Only for issues.
Milestone

Comments

@CAJan93
Copy link
Contributor

CAJan93 commented Feb 28, 2023

Describe the bug

Meta pod does not restart if it looses leadership. Instead the meta pod stays in an unknown state.

2023-02-28T14:23:52.460829Z ERROR risingwave_meta::rpc::election_client: keep alive failed, stopping main loop
2023-02-28T14:23:52.460848Z  WARN risingwave_meta::rpc::election_client: client testinstance-meta-default-1.testinstance-meta:5690 lost leadership

When looking at the MembersResponse of the meta pod, I get

address:{host:"testinstance-meta-default-2.testinstance-meta"  port:5690}  is_leader:true
address:{host:"testinstance-meta-default-0.testinstance-meta"  port:5690}]

testinstance-meta-default-1, the former leader is not listed.

To Reproduce

  • Start a setup with multiple meta nodes
  • Force delete the etcd pod
  • Leader will loose leadership and stay in unknown state

Expected behavior

My expectation would have been that fencing kicks in once the pod looses leadership. The pod can then restart as a follower.

Additional context

Used image: nightly-20230227

@arkbriar We could also do the fencing in the operator, but I believe doing it here will be quicker.

CC @shanicky

@CAJan93 CAJan93 added the type/bug Type: Bug. Only for issues. label Feb 28, 2023
@github-actions github-actions bot added this to the release-0.1.18 milestone Feb 28, 2023
@lmatz
Copy link
Contributor

lmatz commented Feb 28, 2023

I am curious about under what situations the etcd pod gets deleted/killed in real world cases

@shanicky
Copy link
Contributor

Describe the bug

Meta pod does not restart if it looses leadership. Instead the meta pod stays in an unknown state.

2023-02-28T14:23:52.460829Z ERROR risingwave_meta::rpc::election_client: keep alive failed, stopping main loop
2023-02-28T14:23:52.460848Z  WARN risingwave_meta::rpc::election_client: client testinstance-meta-default-1.testinstance-meta:5690 lost leadership

When looking at the MembersResponse of the meta pod, I get

address:{host:"testinstance-meta-default-2.testinstance-meta"  port:5690}  is_leader:true
address:{host:"testinstance-meta-default-0.testinstance-meta"  port:5690}]

testinstance-meta-default-1, the former leader is not listed.

To Reproduce

  • Start a setup with multiple meta nodes
  • Force delete the etcd pod
  • Leader will loose leadership and stay in unknown state

Expected behavior

My expectation would have been that fencing kicks in once the pod looses leadership. The pod can then restart as a follower.

Additional context

Used image: nightly-20230227

@arkbriar We could also do the fencing in the operator, but I believe doing it here will be quicker.

CC @shanicky

It seems to be a problem caused by the meta not being able to gracefully shutdown, the election client already thinks it has lost leadership and has entered the lost leader process, but the whole program can't exit for some reason

@shanicky
Copy link
Contributor

shanicky commented Feb 28, 2023

I am curious about under what situations the etcd pod gets deleted/killed

If the connection to etcd is lost, meta's election client will attempt to rebuild the connection within ttl/2 time after the last successful keep-alive.

If the connection is successfully rebuilt:

  • If the data in etcd is still present and current meta is still the leader, the leader lease will continue to be kept alive.
  • If the data in etcd is still present, but meta loses leadership after reconnecting, it will exit proactively.
  • If etcd has been cleared, the leader lease will expire and meta will lose leadership and exit proactively.

These three branches are part of a Tokio select.

If the connection attempt fails (e.g., due to etcd being down) or if there is a master switch in a multi-node etcd cluster, meta will attempt to reconnect within ttl/2 time. If the connection timeout is exceeded, meta will assume it has lost leadership and will exit proactively.

@shanicky shanicky self-assigned this Feb 28, 2023
@shanicky
Copy link
Contributor

shanicky commented Mar 1, 2023

When etcd is killed, other managers may be writing to etcd. In this case, they will attempt to retry and reconnect to etcd within a certain timeout period. This may cause the election client to believe the leadership has been lost, but the program has not yet exited.

@shanicky
Copy link
Contributor

shanicky commented Mar 1, 2023

Could you provide a more complete pod distribution, such as the output of 'kubectl get pod'? @CAJan93

@CAJan93
Copy link
Contributor Author

CAJan93 commented Mar 1, 2023

I am curious about under what situations the etcd pod gets deleted/killed in real world cases

We probably will never have that, but what would be enough would be a network issue between the meta leader pod and the etcd pod.

Also, even if it is unlikely, I think we should prepare for a case when ETCD actually crashes.

@CAJan93
Copy link
Contributor Author

CAJan93 commented Mar 1, 2023

Comments

If the connection to etcd is lost, meta's election client will attempt to rebuild the connection within ttl/2 time after the last successful keep-alive.

@shanicky Is ttl equivalent to meta_leader_lease_secs?

It seems to be a problem caused by the meta not being able to gracefully shutdown

Maybe it is also because the key in ETCD is gone? Maybe we would not have this if we would crash ETCD and then re-populate the new ETCD instance with the old keys?

@shanicky Please let me know if I misunderstand the lease below or if you think that this is a non-issue.

I ran the below experience twice with the same outcome:

Experiment

I am only using etcd and meta in my RW resource to simplify things here. Image is nightly-20230207

Before ETCD incidence

k get pods -A                     

NAMESPACE                    NAME                                                      READY   STATUS    RESTARTS        AGE
cert-manager                 cert-manager-cainjector-857ff8f7cb-lbrp9                  1/1     Running   0               3m32s
cert-manager                 cert-manager-d58554549-lbbrg                              1/1     Running   0               3m32s
cert-manager                 cert-manager-webhook-76fdf7c485-rckq5                     1/1     Running   0               3m32s
default                      rwproxy-5b76f598f4-g78qk                                  1/1     Running   0               5m3s
kube-system                  coredns-565d847f94-fv949                                  1/1     Running   0               5m26s
kube-system                  coredns-565d847f94-nqbfx                                  1/1     Running   0               5m26s
kube-system                  etcd-onebox-control-plane                                 1/1     Running   0               5m42s
kube-system                  kindnet-gt6r6                                             1/1     Running   0               5m26s
kube-system                  kube-apiserver-onebox-control-plane                       1/1     Running   0               5m40s
kube-system                  kube-controller-manager-onebox-control-plane              1/1     Running   0               5m40s
kube-system                  kube-proxy-hlpqg                                          1/1     Running   0               5m26s
kube-system                  kube-scheduler-onebox-control-plane                       1/1     Running   0               5m40s
local-path-storage           local-path-provisioner-684f458cdd-b6sfx                   1/1     Running   0               5m26s
risingwave-operator-system   risingwave-operator-controller-manager-58c5798c56-b9v56   2/2     Running   0               2m26s
rwc-2-mytenant               risingwave-etcd-mytenant-0                                1/1     Running   1 (2m32s ago)   2m46s
rwc-2-mytenant               risingwave-meta-default-0                                 1/1     Running   0               2m6s
rwc-2-mytenant               risingwave-meta-default-1                                 1/1     Running   0               37s
rwc-2-mytenant               risingwave-meta-default-2                                 1/1     Running   0               37s

I am using this rw operator. It adds a role label to the meta pods

Status at the beginning of the experiment

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          5m28s   leader
risingwave-meta-default-1   1/1     Running   0          3m59s   follower
risingwave-meta-default-2   1/1     Running   0          3m59s   follower

Let's confirm that using the logs:

for i in {0..2}; do echo meta-$i; k logs risingwave-meta-default-${i}  | grep "Defining leader services"; done
meta-0
2023-03-01T09:57:16.595646Z  INFO risingwave_meta::rpc::server: Defining leader services
meta-1
meta-2

Let's observe the change in the status of the meta pods using k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta --watch

ETCD content is odd?

What does ETCD say about who the leader is right now?

k port-forward svc/risingwave-etcd-mytenant 2379 &
etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e221d"
"CreateRevision" : 2
"ModRevision" : 2
"Version" : 1
"Value" : "risingwave-meta-default-0.risingwave-meta:5690"
"Lease" : 4734838590407582237

"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

If we view the logs of our meta pods, we see that meta_leader_lease_secs: 10, so "Lease" : 4734838590407582514 looks a bit odd to me, since this is 97 years from now (see this timestamp converter). Do I misunderstand the meaning of lease? Is this not the time when this value expires? Or am I looking at the wrong key here? Is it not __meta_election_?

ETCD incidence and effects on meta

Since our lease never expires, simply blocking the network between the leader pod and ETCD does not help. Let's simulate lease expiration by deleting the current leader lease

etcdctl --endpoints=127.0.0.1:2379 del __meta_election_/41b5869c9a0e221d
etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

The former leader gets into an unknown state and meta-1 is elected as leader. meta-0 did not restart

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta --watch
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          10m     leader
risingwave-meta-default-1   1/1     Running   0          8m43s   follower
risingwave-meta-default-2   1/1     Running   0          8m43s   follower
risingwave-meta-default-0   1/1     Running   0          54m     unknown
risingwave-meta-default-1   1/1     Running   0          52m     leader

# meta-1 is the actual leader according to the logs
k logs risingwave-meta-default-1 | grep "Defining leader services"
2023-03-01T09:57:16.595646Z  INFO risingwave_meta::rpc::server: Defining leader services

# original leader meta-0 stuck 
k logs risingwave-meta-default-0 | tail
2023-03-01T10:03:48.949202Z  INFO risingwave_meta::hummock::compactor_manager: Removed compactor session 2
2023-03-01T10:03:48.953430Z  INFO risingwave_meta::hummock::manager::worker: Released hummock context 2
2023-03-01T10:03:49.949167Z  WARN risingwave_meta::manager::cluster: Deleted expired worker 1 HostAddress {
    host: "10.244.0.20",
    port: 6660,
}, current timestamp 1677665029
2023-03-01T10:03:49.949202Z  INFO risingwave_meta::hummock::compactor_manager: Removed compactor session 1
2023-03-01T10:03:49.951419Z  INFO risingwave_meta::hummock::manager::worker: Released hummock context 1
2023-03-01T10:50:53.755999Z  WARN risingwave_meta::rpc::election_client: leader has been changed to risingwave-meta-default-1.risingwave-meta:5690
2023-03-01T10:50:53.756212Z  WARN risingwave_meta::rpc::election_client: client risingwave-meta-default-0.risingwave-meta:5690 lost leadership

Status after about 5 to 10 minutes:

etcdctl --endpoints=127.0.0.1:2379 get _ --prefix=true --write-out="fields"
"Key" : "__meta_election_/41b5869c9a0e2327"
"CreateRevision" : 108
"ModRevision" : 108
"Version" : 1
"Value" : "risingwave-meta-default-1.risingwave-meta:5690"
"Lease" : 4734838590407582503

"Key" : "__meta_election_/41b5869c9a0e2332"
"CreateRevision" : 111
"ModRevision" : 111
"Version" : 1
"Value" : "risingwave-meta-default-2.risingwave-meta:5690"
"Lease" : 4734838590407582514

k get pods -n rwc-2-mytenant -L risingwave/meta-role -l risingwave/component=meta
NAME                        READY   STATUS    RESTARTS   AGE   META-ROLE
risingwave-meta-default-0   1/1     Running   0          62m   unknown
risingwave-meta-default-1   1/1     Running   0          60m   leader
risingwave-meta-default-2   1/1     Running   0          60m   follower

k logs risingwave-meta-default-0 | tail -1
2023-03-01T10:50:53.756212Z  WARN risingwave_meta::rpc::election_client: client risingwave-meta-default-0.risingwave-meta:5690 lost leadership

Expected behaviour

I would have expected that this happens:

If etcd has been cleared, the leader lease will expire and meta will lose leadership and exit proactively.

@CAJan93
Copy link
Contributor Author

CAJan93 commented Mar 1, 2023

Okay, I did misunderstand the meaning of lease. The timeout of the lease seems to be correct:

etcdctl lease list
found 2 leases
41b5869cf076776f
41b5869cf0767775

etcdctl lease timetolive 41b5869cf076776f
lease 41b5869cf076776f granted with TTL(10s), remaining(9s)

etcdctl lease timetolive 41b5869cf0767775
lease 41b5869cf0767775 granted with TTL(10s), remaining(9s)

@shanicky
Copy link
Contributor

You can directly use the command-line tool of the election module to check the leader. 😃

@shanicky
Copy link
Contributor

please retry the logic after #8734

@CAJan93
Copy link
Contributor Author

CAJan93 commented Mar 23, 2023

Tried it using your PR. Worked really well.

k get pods  -L risingwave/meta-role -l risingwave/component=meta --watch
NAME                        READY   STATUS    RESTARTS   AGE     META-ROLE
risingwave-meta-default-0   1/1     Running   0          3m43s   leader
risingwave-meta-default-1   1/1     Running   0          5m10s   follower
# delete lease here
risingwave-meta-default-0   0/1     Completed   0          5m4s    leader
risingwave-meta-default-0   0/1     Completed   0          5m4s    unknown
risingwave-meta-default-0   0/1     Running     1 (1s ago)   5m5s    unknown
risingwave-meta-default-1   1/1     Running     0            6m32s   leader
risingwave-meta-default-0   0/1     Running     1 (1s ago)   5m5s    follower
risingwave-meta-default-0   1/1     Running     1 (2s ago)   5m6s    follower

Let's try with the graceful shutdown.

@CAJan93 CAJan93 closed this as completed Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: Bug. Only for issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants