-
Notifications
You must be signed in to change notification settings - Fork 639
feat(frontend): add heartbeat between frontend and compute #16014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How about trying it once in Chaos Mesh? https://buildkite.com/risingwave-test/chaos-mesh |
|
check_heartbeat_interval | ||
.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); | ||
check_heartbeat_interval.reset(); | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to listen on shutdown_senders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... The shutdown_senders
and join_handers
in FrontendEnv
were not used... Let's fix it later 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the frontend node process, shutdown_senders
and those background tasks has the same lifecycle, we don't need to handle it actually🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
Do we have the root cause why sometimes compute nodes are unavailable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong: this is not to resolve the issue thoroughly but somehow a best-effort. The heartbeat cannot always be on time so it's still possible to get error when interacting with a stale client.
I wonder if we can return some actionable error messages back to the users, e.g.
happens again, the users will retry on their own. Depends on whether we can really detect and be sure this is really the case |
Then how does this PR resolve #14569? Consider the case where all compute nodes in a cluster restart, masking compute nodes temporarily won't help if there's no refreshing mechanism for clients. |
There are various situations. For example:
|
Your understand is correct. |
If all compute nodes is masked by the frontend, then we have a special logic to treat it as no compute nodes masked. |
I am suspecting that whether this addresses the issue that we wanted to solve. In #14569, we observed perpetual stale connections, even though the CN restarts happened hours ago. Actually, if the stale connections are not perpetual and just affect the query during/around recovery, we consider this as expected behavior. But this solution didn't reconnect, but rather mask the connection for a while, which just delay the problem on these aforementioned perpetual stale connections. Did I understand correctly? |
I think theoretically the reconnection could be handled by the gRPC library transparently, but it turns out not: hyperium/tonic#1254. |
Let me add an invalidate logic to trigger a reconnection. |
Saw the new changes. Shall we remove the previous logic of Additionally, as mentioned in #16014 (comment), the case should be handled by connection pool (i.e. when seeing a broken connection, drop the connection and reconnect) but unfortunately not, so this approach is acceptable to me. |
cc @yezizp2012 May I ask how this is achieved for the compute clients during recovery on the meta node? |
mask_worker_node is used to handle the case e.g. killing a pod of compute node but it isn't unregistered from the meta. |
I think it relies on the reconnection mechanism powered by tonic. Since meta will always be communicating with the compute node, there should not be any stale connections, which is different from the situation in the frontend. |
After a discussion with @fuyufjh, we both think after recovery, the meta node should notify the frontend node to reconstruct the connection pool between the frontend and the compute node. |
But it looks like tonic does not reconnect itself. 😕 #16014 (comment) |
😕 During the recovery process, the meta client is still used in the usual way through the connection pool. However, I found something that may be related: after v1.7, during recovery there were occasional issues where meta could not connect to CN after it restarted in some of our clients' env. After v1.7, the communication way meta communicate with CN was switched to stream RPC, such as in the step of resetting compute nodes. This is similar to how queries from frontend to CN are conducted via stream RPC. Not sure if it's related, just FYI. |
Now that #16215 got merged, will we still proceed on this PR? |
I will change the purpose of this PR to refactor masking work nodes for batch queries. Previously we implemented this logic for distributed queries and local queries respectively which is a bit invasive. Using a heartbeat could achieve a similar purpose but more elegantly. |
This PR has been open for 60 days with no activity. Could you please update the status? Feel free to ping a reviewer if you are waiting for review. |
Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏 |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.