-
Notifications
You must be signed in to change notification settings - Fork 640
feat: allow drop streaming jobs during recovery #12203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cc @zwang28 . |
I encountered the same failure. When can this issue be merged? |
This should be done in this week. |
Since we've introduced |
👍 Yes,
Allowing to drop during recovery may be more natural and does not require additional learning costs for the user. |
Or can we just introduce a concept of safe mode and decouple it with bootstrap? Like introducing some sql commands to change the cluster to safe mode whether it is under recovering or not and allow user to run any ddl commands before changing it to normal mode. |
Can you elaborate more on this? IIUC, the recovery is simply to build the actors and inject a barrier with kind So to save this, we have to issue a |
Actually, I didn't find sufficient motivation that users to want to enter this "safe mode" when the cluster is running well. 😂 On the other side, if the actors become stuck or problematic, then there might be no way to notify them to pause without a restart or recovery. I guess an improvement to the current behavior could be automatically entering the safe mode if the number of continuous recovery attempts reaches a threshold. BTW, I agree that we can add a SQL command to |
Yes, that's exactly the issue found in client environment.
👍 That's true, the
Really like the idea, but we'd better be careful to set the size of this threshold or be smarter to automatically enter safe mode according to the reason why the actor exits. |
I will close this issue and the associated PR, let's fixing this kind of issue with
FYI, Cc @zwang28 @artorias1024 . |
Pls correct me if I'm wrong, I encountered a continuous recovery loop and |
Are you in the case where the executor even fails to handle the very first barrier? If so, then I guess |
I saw log "ignored syncing data for the first barrier" from the OOMed CN, so the first barrier was handled successfully. |
UPDATE:
After offline discussion with @yezizp2012 I finally find the statement above not accurate. Once the risingwave/src/meta/src/barrier/recovery.rs Lines 212 to 236 in cca63a0
That is to say, we don't have to "issue a Actually the missing piece is that if the initial recovery barrier can not succeed, we'll enter a recovery loop. It might be a bug when initializing an executor, or happen if any following data after the recovery barrier lead to crash. In this case, we don't have a chance to drop the problematic streaming jobs, since DDL requests will be rejected during the recovery. risingwave/src/meta/src/rpc/ddl_controller.rs Lines 224 to 229 in cca63a0
Therefore, all we need to do is relax this restriction in order to complete that final piece. That's exactly #12317 from @yezizp2012. After it getting merged, the |
Is your feature request related to a problem? Please describe.
When there are problematic streaming jobs, the client cluster may experience continuous out-of-memory (OOM) issues on the cn node, resulting in unsuccessful recovery of the cluster. At this time, it should be allowed to drop these jobs during the recovery process so that the cluster can recover successfully.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: