Skip to content

feat: allow drop streaming jobs during recovery #12203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yezizp2012 opened this issue Sep 11, 2023 · 14 comments · Fixed by #12317
Closed

feat: allow drop streaming jobs during recovery #12203

yezizp2012 opened this issue Sep 11, 2023 · 14 comments · Fixed by #12317
Assignees
Labels
type/feature Type: New feature.
Milestone

Comments

@yezizp2012
Copy link
Member

Is your feature request related to a problem? Please describe.

When there are problematic streaming jobs, the client cluster may experience continuous out-of-memory (OOM) issues on the cn node, resulting in unsuccessful recovery of the cluster. At this time, it should be allowed to drop these jobs during the recovery process so that the cluster can recover successfully.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@yezizp2012 yezizp2012 added the type/feature Type: New feature. label Sep 11, 2023
@yezizp2012 yezizp2012 self-assigned this Sep 11, 2023
@github-actions github-actions bot added this to the release-1.3 milestone Sep 11, 2023
@yezizp2012
Copy link
Member Author

yezizp2012 commented Sep 11, 2023

Cc @zwang28 .

@artorias1024
Copy link

Cc @zwang28 .

I encountered the same failure. When can this issue be merged?

@yezizp2012
Copy link
Member Author

I encountered the same failure. When can this issue be merged?

This should be done in this week.

@BugenZhao
Copy link
Member

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

@yezizp2012
Copy link
Member Author

yezizp2012 commented Sep 15, 2023

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

👍 Yes, pause_on_next_bootstrap is definitely works for OOM issue as well. But it might be a little complicated to operate, since it requires user to:

  1. alter this parameter and restart the meta node.
  2. drop streaming jobs.
  3. resume the cluster by risectl or restart the meta node again.

Allowing to drop during recovery may be more natural and does not require additional learning costs for the user.

@yezizp2012
Copy link
Member Author

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

Or can we just introduce a concept of safe mode and decouple it with bootstrap? Like introducing some sql commands to change the cluster to safe mode whether it is under recovering or not and allow user to run any ddl commands before changing it to normal mode.

@BugenZhao
Copy link
Member

resulting in unsuccessful recovery of the cluster

Can you elaborate more on this? IIUC, the recovery is simply to build the actors and inject a barrier with kind Initial, while no data should be associated with this epoch. Only the following barriers will actually load data and lead to issues.

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me. In my opinion, I'd prefer to gracefully pause the cluster and let users fix the situation in an unhurried way.

@BugenZhao
Copy link
Member

BugenZhao commented Sep 19, 2023

Like introducing some sql commands to change the cluster to safe mode

Actually, I didn't find sufficient motivation that users to want to enter this "safe mode" when the cluster is running well. 😂 On the other side, if the actors become stuck or problematic, then there might be no way to notify them to pause without a restart or recovery.

I guess an improvement to the current behavior could be automatically entering the safe mode if the number of continuous recovery attempts reaches a threshold. BTW, I agree that we can add a SQL command to resume all data sources as a more user-friendly replacement of risectl.

@yezizp2012
Copy link
Member Author

Can you elaborate more on this? IIUC, the recovery is simply to build the actors and inject a barrier with kind Initial, while no data should be associated with this epoch. Only the following barriers will actually load data and lead to issues.

Yes, that's exactly the issue found in client environment.

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me. In my opinion, I'd prefer to gracefully pause the cluster and let users fix the situation in an unhurried way.

👍 That's true, the DROP command will be successfully executed right only during recovery or at the first place after recovery. It is a race of time.

I guess an improvement to the current behavior could be automatically entering the safe mode if the number of continuous recovery attempts reaches a threshold. BTW, I agree that we can add a SQL command to resume all data sources as a more user-friendly replacement of risectl.

Really like the idea, but we'd better be careful to set the size of this threshold or be smarter to automatically enter safe mode according to the reason why the actor exits.

@yezizp2012
Copy link
Member Author

I will close this issue and the associated PR, let's fixing this kind of issue with pause_on_next_bootstrap. Let me go into a little more detail about the steps of the operation:

  1. alter system set pause_on_next_bootstrap to true;
  2. reboot the meta service, then the cluster will enter safe mode after recovery.
  3. drop the target streaming jobs.
  4. resume the cluster by risectl or restart the meta node again.

FYI, Cc @zwang28 @artorias1024 .

@zwang28
Copy link
Contributor

zwang28 commented Oct 9, 2023

Pls correct me if I'm wrong, pause_on_next_bootstrap doesn't work for case where temporal filter is involved.

I encountered a continuous recovery loop and pause_on_next_bootstrap doesn't help out. The recovery was caused by CN OOM and there was source throughput observed from a MV with temporal filter. (I haven't figured out the cause of OOM)

@BugenZhao @yezizp2012

@BugenZhao
Copy link
Member

Are you in the case where the executor even fails to handle the very first barrier? If so, then I guess pause_on_next_bootstrap won't help because it expects executors to at least correctly initialize themselves. 🥵

@zwang28
Copy link
Contributor

zwang28 commented Oct 9, 2023

fails to handle the very first barrier

I saw log "ignored syncing data for the first barrier" from the OOMed CN, so the first barrier was handled successfully.
Let me provide more info later to help discussion (currently some metric is unavailable).

@BugenZhao
Copy link
Member

UPDATE:

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me.

After offline discussion with @yezizp2012 I finally find the statement above not accurate. Once the Drop RPC is issued, it'll first clean up the catalog before scheduling the command into the barrier manager. Even if the command is not executed successfully (i.e. graceful drop), the barrier manager will still clean up the fragments on next recovery.

/// Recovery the whole cluster from the latest epoch.
///
/// If `paused_reason` is `Some`, all data sources (including connectors and DMLs) will be
/// immediately paused after recovery, until the user manually resume them either by restarting
/// the cluster or `risectl` command. Used for debugging purpose.
///
/// Returns the new state of the barrier manager after recovery.
pub async fn recovery(
&self,
prev_epoch: TracedEpoch,
paused_reason: Option<PausedReason>,
) -> BarrierManagerState {
// Mark blocked and abort buffered schedules, they might be dirty already.
self.scheduled_barriers
.abort_and_mark_blocked("cluster is under recovering")
.await;
tracing::info!("recovery start!");
self.clean_dirty_tables()
.await
.expect("clean dirty tables should not fail");
self.clean_dirty_fragments()
.await
.expect("clean dirty fragments");

That is to say, we don't have to "issue a DROP command right after the initial recovery barrier and before any following barriers". DROPs after following barriers already work now!

Actually the missing piece is that if the initial recovery barrier can not succeed, we'll enter a recovery loop. It might be a bug when initializing an executor, or happen if any following data after the recovery barrier lead to crash. In this case, we don't have a chance to drop the problematic streaming jobs, since DDL requests will be rejected during the recovery.

/// `run_command` spawns a tokio coroutine to execute the target ddl command. When the client
/// has been interrupted during executing, the request will be cancelled by tonic. Since we have
/// a lot of logic for revert, status management, notification and so on, ensuring consistency
/// would be a huge hassle and pain if we don't spawn here.
pub async fn run_command(&self, command: DdlCommand) -> MetaResult<NotificationVersion> {
self.check_barrier_manager_status().await?;

Therefore, all we need to do is relax this restriction in order to complete that final piece. That's exactly #12317 from @yezizp2012. After it getting merged, the pause_on_next_bootstrap approach may not be necessary any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Type: New feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants