feat(frontend): add heartbeat between frontend and compute #16014

chenzl25 · 2024-03-29T03:58:40Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Resolve bug: stale connection in the connection pool of frontend when CN restarts #14569
Related issue bug: frontend can't connect to CN sometimes when CNs have been restarted #14576 Chaos Mesh pod-kill-one-compute fails occasionally even the cluster seems to recover successfully #14030 Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217
Add heartbeat between frontend and compute. If some compute nodes are unreachable, mask it unavailable for seconds, so that batch queries can still work.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

lmatz · 2024-03-31T14:40:54Z

How about trying it once in Chaos Mesh? https://buildkite.com/risingwave-test/chaos-mesh
cc: @xuefengze could you list the environment variables needed for triggering the #14030 and #14217 on Buildkite?

xuefengze · 2024-04-01T02:50:24Z

How about trying it once in Chaos Mesh? https://buildkite.com/risingwave-test/chaos-mesh
cc: @xuefengze could you list the environment variables needed for triggering the #14030 and #14217 on Buildkite?

~~Maybe we can use scale in to test it? After scaling in, if we don't manually unregister the worker, many SQL operations will fail, and RW will wait for recovery until the CN has expired.~~

src/frontend/src/session.rs

…pute

src/frontend/src/session.rs

fuyufjh · 2024-04-03T13:46:23Z

src/frontend/src/session.rs

+            check_heartbeat_interval
+                .set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
+            check_heartbeat_interval.reset();
+            loop {


Need to listen on shutdown_senders

Oops... The shutdown_senders and join_handers in FrontendEnv were not used... Let's fix it later 😅

It seems the frontend node process, shutdown_senders and those background tasks has the same lifecycle, we don't need to handle it actually🤔

fuyufjh

The rest LGTM

yuhao-su · 2024-04-04T18:07:15Z

Do we have the root cause why sometimes compute nodes are unavailable?

BugenZhao

Correct me if I'm wrong: this is not to resolve the issue thoroughly but somehow a best-effort. The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

lmatz · 2024-04-05T05:40:31Z

I wonder if we can return some actionable error messages back to the users, e.g. retry after "heartbeat interval" time, so that when

The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

happens again, the users will retry on their own.

Depends on whether we can really detect and be sure this is really the case

BugenZhao · 2024-04-05T06:49:57Z

happens again, the users will retry on their own.

Then how does this PR resolve #14569? Consider the case where all compute nodes in a cluster restart, masking compute nodes temporarily won't help if there's no refreshing mechanism for clients.

chenzl25 · 2024-04-07T03:43:19Z

Do we have the root cause why sometimes compute nodes are unavailable?

There are various situations. For example:

Delete a pod of compute node but it is still unregistered.
Compute node OOM.
Network between frontend and compute is cutoff.

chenzl25 · 2024-04-07T03:44:01Z

Correct me if I'm wrong: this is not to resolve the issue thoroughly but somehow a best-effort. The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

Your understand is correct.

chenzl25 · 2024-04-07T03:46:55Z

happens again, the users will retry on their own.

Then how does this PR resolve #14569? Consider the case where all compute nodes in a cluster restart, masking compute nodes temporarily won't help if there's no refreshing mechanism for clients.

If all compute nodes is masked by the frontend, then we have a special logic to treat it as no compute nodes masked.
#14569 actually is a guess of why frontend node sometimes can connect to the compute node and sometimes it can't. But I realized that this guess is wrong, because we have a mask mechanism previously(#10328) which would mask a compute node for a while if it is unreachable when executing a distributed query. So it is highly possible that the whole cluster is triggering recovery again and again, but the mask mechanism makes the cluster still available to serve batch queries.

…pute

fuyufjh · 2024-04-08T03:51:46Z

I am suspecting that whether this addresses the issue that we wanted to solve.

In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.

But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

BugenZhao · 2024-04-08T06:22:26Z

But this solution didn't reconnect

I think theoretically the reconnection could be handled by the gRPC library transparently, but it turns out not: hyperium/tonic#1254.

chenzl25 · 2024-04-08T06:39:10Z

I am suspecting that whether this addresses the issue that we wanted to solve.

In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.

But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

Let me add an invalidate logic to trigger a reconnection.

fuyufjh · 2024-04-08T07:50:43Z

I am suspecting that whether this addresses the issue that we wanted to solve.
In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.
But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

Let me add an invalidate logic to trigger a reconnection.

Saw the new changes.

Shall we remove the previous logic of mask_worker_node()? Because it is interfering with the invalidate at src/frontend/src/session.rs:L430-431. For the cases that I mentioned, the connection is expected to be alive after invalidate and reconnect, masking it will add unnecessary latency.

Additionally, as mentioned in #16014 (comment), the case should be handled by connection pool (i.e. when seeing a broken connection, drop the connection and reconnect) but unfortunately not, so this approach is acceptable to me.

…pute

BugenZhao · 2024-04-08T07:55:35Z

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

chenzl25 · 2024-04-08T07:59:15Z

Shall we remove the previous logic of mask_worker_node()?

mask_worker_node is used to handle the case e.g. killing a pod of compute node but it isn't unregistered from the meta.

yezizp2012 · 2024-04-08T08:05:33Z

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

chenzl25 · 2024-04-08T08:54:10Z

After a discussion with @fuyufjh, we both think after recovery, the meta node should notify the frontend node to reconstruct the connection pool between the frontend and the compute node.

chenzl25 · 2024-04-09T07:21:50Z

After a discussion with @fuyufjh, we both think after recovery, the meta node should notify the frontend node to reconstruct the connection pool between the frontend and the compute node.

Use another PR to resolve #14569

BugenZhao · 2024-04-09T16:38:13Z

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

But it looks like tonic does not reconnect itself. 😕 #16014 (comment)

yezizp2012 · 2024-04-10T04:50:50Z

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

But it looks like tonic does not reconnect itself. 😕 #16014 (comment)

😕 During the recovery process, the meta client is still used in the usual way through the connection pool. However, I found something that may be related: after v1.7, during recovery there were occasional issues where meta could not connect to CN after it restarted in some of our clients' env. After v1.7, the communication way meta communicate with CN was switched to stream RPC, such as in the step of resetting compute nodes. This is similar to how queries from frontend to CN are conducted via stream RPC. Not sure if it's related, just FYI.

BugenZhao · 2024-04-11T05:53:46Z

Now that #16215 got merged, will we still proceed on this PR?

chenzl25 · 2024-04-11T06:08:17Z

Now that #16215 got merged, will we still proceed on this PR?

I will change the purpose of this PR to refactor masking work nodes for batch queries. Previously we implemented this logic for distributed queries and local queries respectively which is a bit invasive. Using a heartbeat could achieve a similar purpose but more elegantly.

github-actions · 2024-06-11T01:52:24Z

This PR has been open for 60 days with no activity. Could you please update the status? Feel free to ping a reviewer if you are waiting for review.

github-actions · 2024-07-03T08:29:12Z

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏
You can reopen it when you have time to continue working on it.

add heartbeat between frontend and compute

0562b8d

chenzl25 requested a review from BugenZhao March 29, 2024 03:58

github-actions bot added the type/feature Type: New feature. label Mar 29, 2024

chenzl25 requested review from liurenjie1024, lmatz and zwang28 March 29, 2024 03:59

chenzl25 added 2 commits March 29, 2024 12:00

fix

d70e05b

fmt

01defe6

lmatz requested a review from xuefengze March 29, 2024 13:58

BugenZhao reviewed Apr 1, 2024

View reviewed changes

src/frontend/src/session.rs Outdated Show resolved Hide resolved

BugenZhao reviewed Apr 1, 2024

View reviewed changes

src/frontend/src/session.rs Show resolved Hide resolved

chenzl25 and others added 3 commits April 1, 2024 16:29

Merge branch 'main' into dylan/add_heartbeat_between_frontend_and_com…

3bfc323

…pute

refine

045495d

Merge branch 'main' into dylan/add_heartbeat_between_frontend_and_com…

ccf0373

…pute

chenzl25 requested review from fuyufjh and removed request for liurenjie1024 April 2, 2024 08:28

fuyufjh reviewed Apr 3, 2024

View reviewed changes

src/frontend/src/session.rs Outdated Show resolved Hide resolved

fuyufjh reviewed Apr 3, 2024

View reviewed changes

BugenZhao reviewed Apr 5, 2024

View reviewed changes

fmt

0e152f3

Merge branch 'main' into dylan/add_heartbeat_between_frontend_and_com…

b7b1627

…pute

Merge branch 'main' into dylan/add_heartbeat_between_frontend_and_com…

9c45f4e

…pute

add invalidate

aa176b8

Merge branch 'main' into dylan/add_heartbeat_between_frontend_and_com…

9e3c72c

…pute

chenzl25 requested review from fuyufjh and BugenZhao April 8, 2024 07:52

chenzl25 mentioned this pull request Apr 9, 2024

feat(frontend): meta notifies frontend after recovery #16215

Merged

9 tasks

github-actions bot added the no-pr-activity label Jun 11, 2024

github-actions bot closed this Jul 3, 2024

feat(frontend): add heartbeat between frontend and compute #16014

feat(frontend): add heartbeat between frontend and compute #16014

Uh oh!

Conversation

chenzl25 commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's changed and what's your intention?

Checklist

Documentation

Release note

Uh oh!

lmatz commented Mar 31, 2024

Uh oh!

xuefengze commented Apr 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fuyufjh Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

fuyufjh Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

chenzl25 Apr 7, 2024

Choose a reason for hiding this comment

Uh oh!

fuyufjh left a comment

Choose a reason for hiding this comment

Uh oh!

yuhao-su commented Apr 4, 2024

Uh oh!

BugenZhao left a comment

Choose a reason for hiding this comment

Uh oh!

lmatz commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugenZhao commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenzl25 commented Apr 7, 2024

Uh oh!

chenzl25 commented Apr 7, 2024

Uh oh!

chenzl25 commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuyufjh commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugenZhao commented Apr 8, 2024

Uh oh!

chenzl25 commented Apr 8, 2024

Uh oh!

fuyufjh commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugenZhao commented Apr 8, 2024

Uh oh!

chenzl25 commented Apr 8, 2024

Uh oh!

yezizp2012 commented Apr 8, 2024

Uh oh!

chenzl25 commented Apr 8, 2024

Uh oh!

chenzl25 commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugenZhao commented Apr 9, 2024

Uh oh!

yezizp2012 commented Apr 10, 2024

Uh oh!

BugenZhao commented Apr 11, 2024

Uh oh!

chenzl25 commented Apr 11, 2024

Uh oh!

github-actions bot commented Jun 11, 2024

Uh oh!

chenzl25 commented Mar 29, 2024 •

edited

Loading

xuefengze commented Apr 1, 2024 •

edited

Loading

lmatz commented Apr 5, 2024 •

edited

Loading

BugenZhao commented Apr 5, 2024 •

edited

Loading

chenzl25 commented Apr 7, 2024 •

edited

Loading

fuyufjh commented Apr 8, 2024 •

edited

Loading

fuyufjh commented Apr 8, 2024 •

edited

Loading

chenzl25 commented Apr 9, 2024 •

edited

Loading