Skip to content

feat(frontend): add heartbeat between frontend and compute #16014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

chenzl25
Copy link
Contributor

@chenzl25 chenzl25 commented Mar 29, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@chenzl25 chenzl25 requested a review from BugenZhao March 29, 2024 03:58
@github-actions github-actions bot added the type/feature Type: New feature. label Mar 29, 2024
@lmatz lmatz requested a review from xuefengze March 29, 2024 13:58
@lmatz
Copy link
Contributor

lmatz commented Mar 31, 2024

How about trying it once in Chaos Mesh? https://buildkite.com/risingwave-test/chaos-mesh
cc: @xuefengze could you list the environment variables needed for triggering the #14030 and #14217 on Buildkite?

@xuefengze
Copy link
Contributor

xuefengze commented Apr 1, 2024

How about trying it once in Chaos Mesh? https://buildkite.com/risingwave-test/chaos-mesh
cc: @xuefengze could you list the environment variables needed for triggering the #14030 and #14217 on Buildkite?

Maybe we can use scale in to test it? After scaling in, if we don't manually unregister the worker, many SQL operations will fail, and RW will wait for recovery until the CN has expired.

@chenzl25 chenzl25 requested review from fuyufjh and removed request for liurenjie1024 April 2, 2024 08:28
check_heartbeat_interval
.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
check_heartbeat_interval.reset();
loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to listen on shutdown_senders

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... The shutdown_senders and join_handers in FrontendEnv were not used... Let's fix it later 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the frontend node process, shutdown_senders and those background tasks has the same lifecycle, we don't need to handle it actually🤔

Copy link
Collaborator

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM

@yuhao-su
Copy link
Contributor

yuhao-su commented Apr 4, 2024

Do we have the root cause why sometimes compute nodes are unavailable?

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: this is not to resolve the issue thoroughly but somehow a best-effort. The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

@lmatz
Copy link
Contributor

lmatz commented Apr 5, 2024

I wonder if we can return some actionable error messages back to the users, e.g. retry after "heartbeat interval" time, so that when

The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

happens again, the users will retry on their own.

Depends on whether we can really detect and be sure this is really the case

@BugenZhao
Copy link
Member

BugenZhao commented Apr 5, 2024

happens again, the users will retry on their own.

Then how does this PR resolve #14569? Consider the case where all compute nodes in a cluster restart, masking compute nodes temporarily won't help if there's no refreshing mechanism for clients.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 7, 2024

Do we have the root cause why sometimes compute nodes are unavailable?

There are various situations. For example:

  1. Delete a pod of compute node but it is still unregistered.
  2. Compute node OOM.
  3. Network between frontend and compute is cutoff.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 7, 2024

Correct me if I'm wrong: this is not to resolve the issue thoroughly but somehow a best-effort. The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.

Your understand is correct.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 7, 2024

happens again, the users will retry on their own.

Then how does this PR resolve #14569? Consider the case where all compute nodes in a cluster restart, masking compute nodes temporarily won't help if there's no refreshing mechanism for clients.

If all compute nodes is masked by the frontend, then we have a special logic to treat it as no compute nodes masked.
#14569 actually is a guess of why frontend node sometimes can connect to the compute node and sometimes it can't. But I realized that this guess is wrong, because we have a mask mechanism previously(#10328) which would mask a compute node for a while if it is unreachable when executing a distributed query. So it is highly possible that the whole cluster is triggering recovery again and again, but the mask mechanism makes the cluster still available to serve batch queries.

@fuyufjh
Copy link
Collaborator

fuyufjh commented Apr 8, 2024

I am suspecting that whether this addresses the issue that we wanted to solve.

In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.

But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

@BugenZhao
Copy link
Member

But this solution didn't reconnect

I think theoretically the reconnection could be handled by the gRPC library transparently, but it turns out not: hyperium/tonic#1254.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 8, 2024

I am suspecting that whether this addresses the issue that we wanted to solve.

In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.

But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

Let me add an invalidate logic to trigger a reconnection.

@fuyufjh
Copy link
Collaborator

fuyufjh commented Apr 8, 2024

I am suspecting that whether this addresses the issue that we wanted to solve.
In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior.
But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly?

Let me add an invalidate logic to trigger a reconnection.

Saw the new changes.

Shall we remove the previous logic of mask_worker_node()? Because it is interfering with the invalidate at src/frontend/src/session.rs:L430-431. For the cases that I mentioned, the connection is expected to be alive after invalidate and reconnect, masking it will add unnecessary latency.


Additionally, as mentioned in #16014 (comment), the case should be handled by connection pool (i.e. when seeing a broken connection, drop the connection and reconnect) but unfortunately not, so this approach is acceptable to me.

@chenzl25 chenzl25 requested review from fuyufjh and BugenZhao April 8, 2024 07:52
@BugenZhao
Copy link
Member

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 8, 2024

Shall we remove the previous logic of mask_worker_node()?

mask_worker_node is used to handle the case e.g. killing a pod of compute node but it isn't unregistered from the meta.

@yezizp2012
Copy link
Member

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 8, 2024

After a discussion with @fuyufjh, we both think after recovery, the meta node should notify the frontend node to reconstruct the connection pool between the frontend and the compute node.

@chenzl25
Copy link
Contributor Author

chenzl25 commented Apr 9, 2024

After a discussion with @fuyufjh, we both think after recovery, the meta node should notify the frontend node to reconstruct the connection pool between the frontend and the compute node.

Use another PR to resolve #14569

@BugenZhao
Copy link
Member

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

But it looks like tonic does not reconnect itself. 😕 #16014 (comment)

@yezizp2012
Copy link
Member

Let me add an invalidate logic to trigger a reconnection.

cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node?

I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend.

But it looks like tonic does not reconnect itself. 😕 #16014 (comment)

😕 During the recovery process, the meta client is still used in the usual way through the connection pool. However, I found something that may be related: after v1.7, during recovery there were occasional issues where meta could not connect to CN after it restarted in some of our clients' env. After v1.7, the communication way meta communicate with CN was switched to stream RPC, such as in the step of resetting compute nodes. This is similar to how queries from frontend to CN are conducted via stream RPC. Not sure if it's related, just FYI.

@BugenZhao
Copy link
Member

Now that #16215 got merged, will we still proceed on this PR?

@chenzl25
Copy link
Contributor Author

Now that #16215 got merged, will we still proceed on this PR?

I will change the purpose of this PR to refactor masking work nodes for batch queries. Previously we implemented this logic for distributed queries and local queries respectively which is a bit invasive. Using a heartbeat could achieve a similar purpose but more elegantly.

Copy link
Contributor

This PR has been open for 60 days with no activity. Could you please update the status? Feel free to ping a reviewer if you are waiting for review.

Copy link
Contributor

github-actions bot commented Jul 3, 2024

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏
You can reopen it when you have time to continue working on it.

@github-actions github-actions bot closed this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: stale connection in the connection pool of frontend when CN restarts
7 participants