feat: allow drop streaming jobs during recovery #12203

yezizp2012 · 2023-09-11T08:21:46Z

Is your feature request related to a problem? Please describe.

When there are problematic streaming jobs, the client cluster may experience continuous out-of-memory (OOM) issues on the cn node, resulting in unsuccessful recovery of the cluster. At this time, it should be allowed to drop these jobs during the recovery process so that the cluster can recover successfully.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

yezizp2012 · 2023-09-11T08:22:11Z

Cc @zwang28 .

artorias1024 · 2023-09-12T07:38:47Z

Cc @zwang28 .

I encountered the same failure. When can this issue be merged?

yezizp2012 · 2023-09-12T07:54:55Z

I encountered the same failure. When can this issue be merged?

This should be done in this week.

BugenZhao · 2023-09-15T04:44:06Z

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

yezizp2012 · 2023-09-15T05:22:39Z

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

👍 Yes, pause_on_next_bootstrap is definitely works for OOM issue as well. But it might be a little complicated to operate, since it requires user to:

alter this parameter and restart the meta node.
drop streaming jobs.
resume the cluster by risectl or restart the meta node again.

Allowing to drop during recovery may be more natural and does not require additional learning costs for the user.

yezizp2012 · 2023-09-15T05:25:58Z

Since we've introduced pause_on_next_bootstrap in #11936, is this still a problem? I guess the recovery will succeed in no time without any actors running. Therefore, we can proceed to drop materialized views at that point.

Or can we just introduce a concept of safe mode and decouple it with bootstrap? Like introducing some sql commands to change the cluster to safe mode whether it is under recovering or not and allow user to run any ddl commands before changing it to normal mode.

BugenZhao · 2023-09-19T05:33:23Z

resulting in unsuccessful recovery of the cluster

Can you elaborate more on this? IIUC, the recovery is simply to build the actors and inject a barrier with kind Initial, while no data should be associated with this epoch. Only the following barriers will actually load data and lead to issues.

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me. In my opinion, I'd prefer to gracefully pause the cluster and let users fix the situation in an unhurried way.

BugenZhao · 2023-09-19T05:38:18Z

Like introducing some sql commands to change the cluster to safe mode

Actually, I didn't find sufficient motivation that users to want to enter this "safe mode" when the cluster is running well. 😂 On the other side, if the actors become stuck or problematic, then there might be no way to notify them to pause without a restart or recovery.

I guess an improvement to the current behavior could be automatically entering the safe mode if the number of continuous recovery attempts reaches a threshold. BTW, I agree that we can add a SQL command to resume all data sources as a more user-friendly replacement of risectl.

yezizp2012 · 2023-09-19T05:59:35Z

Can you elaborate more on this? IIUC, the recovery is simply to build the actors and inject a barrier with kind Initial, while no data should be associated with this epoch. Only the following barriers will actually load data and lead to issues.

Yes, that's exactly the issue found in client environment.

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me. In my opinion, I'd prefer to gracefully pause the cluster and let users fix the situation in an unhurried way.

👍 That's true, the DROP command will be successfully executed right only during recovery or at the first place after recovery. It is a race of time.

I guess an improvement to the current behavior could be automatically entering the safe mode if the number of continuous recovery attempts reaches a threshold. BTW, I agree that we can add a SQL command to resume all data sources as a more user-friendly replacement of risectl.

Really like the idea, but we'd better be careful to set the size of this threshold or be smarter to automatically enter safe mode according to the reason why the actor exits.

yezizp2012 · 2023-09-19T06:08:29Z

I will close this issue and the associated PR, let's fixing this kind of issue with pause_on_next_bootstrap. Let me go into a little more detail about the steps of the operation:

alter system set pause_on_next_bootstrap to true;
reboot the meta service, then the cluster will enter safe mode after recovery.
drop the target streaming jobs.
resume the cluster by risectl or restart the meta node again.

FYI, Cc @zwang28 @artorias1024 .

zwang28 · 2023-10-09T08:35:55Z

Pls correct me if I'm wrong, pause_on_next_bootstrap doesn't work for case where temporal filter is involved.

I encountered a continuous recovery loop and pause_on_next_bootstrap doesn't help out. The recovery was caused by CN OOM and there was ~~source~~ throughput observed ~~from a MV with temporal filter~~. (I haven't figured out the cause of OOM)

@BugenZhao @yezizp2012

BugenZhao · 2023-10-09T11:54:45Z

Are you in the case where the executor even fails to handle the very first barrier? If so, then I guess pause_on_next_bootstrap won't help because it expects executors to at least correctly initialize themselves. 🥵

zwang28 · 2023-10-09T13:29:20Z

fails to handle the very first barrier

I saw log "ignored syncing data for the first barrier" from the OOMed CN, so the first barrier was handled successfully.
Let me provide more info later to help discussion (currently some metric is unavailable).

BugenZhao · 2023-11-30T06:02:46Z

UPDATE:

So to save this, we have to issue a DROP command right after the initial recovery barrier and before any following barriers, which sounds like a race of time to me.

After offline discussion with @yezizp2012 I finally find the statement above not accurate. Once the Drop RPC is issued, it'll first clean up the catalog before scheduling the command into the barrier manager. Even if the command is not executed successfully (i.e. graceful drop), the barrier manager will still clean up the fragments on next recovery.

risingwave/src/meta/src/barrier/recovery.rs

Lines 212 to 236 in cca63a0

    
               /// Recovery the whole cluster from the latest epoch. 
        
               /// 
        
               /// If `paused_reason` is `Some`, all data sources (including connectors and DMLs) will be 
        
               /// immediately paused after recovery, until the user manually resume them either by restarting 
        
               /// the cluster or `risectl` command. Used for debugging purpose. 
        
               /// 
        
               /// Returns the new state of the barrier manager after recovery. 
        
               pub async fn recovery( 
        
                   &self, 
        
                   prev_epoch: TracedEpoch, 
        
                   paused_reason: Option<PausedReason>, 
        
               ) -> BarrierManagerState { 
        
                   // Mark blocked and abort buffered schedules, they might be dirty already. 
        
                   self.scheduled_barriers 
        
                       .abort_and_mark_blocked("cluster is under recovering") 
        
                       .await; 
        
                   tracing::info!("recovery start!"); 
        
                   self.clean_dirty_tables() 
        
                       .await 
        
                       .expect("clean dirty tables should not fail"); 
        
                   self.clean_dirty_fragments() 
        
                       .await 
        
                       .expect("clean dirty fragments");

That is to say, we don't have to "issue a DROP command right after the initial recovery barrier and before any following barriers". DROPs after following barriers already work now!

Actually the missing piece is that if the initial recovery barrier can not succeed, we'll enter a recovery loop. It might be a bug when initializing an executor, or happen if any following data after the recovery barrier lead to crash. In this case, we don't have a chance to drop the problematic streaming jobs, since DDL requests will be rejected during the recovery.

risingwave/src/meta/src/rpc/ddl_controller.rs

Lines 224 to 229 in cca63a0

    
               /// `run_command` spawns a tokio coroutine to execute the target ddl command. When the client 
        
               /// has been interrupted during executing, the request will be cancelled by tonic. Since we have 
        
               /// a lot of logic for revert, status management, notification and so on, ensuring consistency 
        
               /// would be a huge hassle and pain if we don't spawn here. 
        
               pub async fn run_command(&self, command: DdlCommand) -> MetaResult<NotificationVersion> { 
        
                   self.check_barrier_manager_status().await?;

Therefore, all we need to do is relax this restriction in order to complete that final piece. That's exactly #12317 from @yezizp2012. After it getting merged, the pause_on_next_bootstrap approach may not be necessary any more.

yezizp2012 added the type/feature Type: New feature. label Sep 11, 2023

yezizp2012 self-assigned this Sep 11, 2023

github-actions bot added this to the release-1.3 milestone Sep 11, 2023

yezizp2012 mentioned this issue Sep 14, 2023

feat: allow drop streaming jobs during recovery #12317

Merged

8 tasks

yezizp2012 closed this as completed Sep 19, 2023

BugenZhao mentioned this issue Oct 26, 2023

Allow user to drop MV or sink during recovery #12999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow drop streaming jobs during recovery #12203

feat: allow drop streaming jobs during recovery #12203

yezizp2012 commented Sep 11, 2023

yezizp2012 commented Sep 11, 2023 •

edited

Loading

Uh oh!

artorias1024 commented Sep 12, 2023

Uh oh!

yezizp2012 commented Sep 12, 2023

Uh oh!

BugenZhao commented Sep 15, 2023

Uh oh!

yezizp2012 commented Sep 15, 2023 •

edited

Loading

Uh oh!

yezizp2012 commented Sep 15, 2023

Uh oh!

BugenZhao commented Sep 19, 2023

Uh oh!

BugenZhao commented Sep 19, 2023 •

edited

Loading

Uh oh!

yezizp2012 commented Sep 19, 2023

Uh oh!

yezizp2012 commented Sep 19, 2023

Uh oh!

zwang28 commented Oct 9, 2023 •

edited

Loading

Uh oh!

BugenZhao commented Oct 9, 2023

Uh oh!

zwang28 commented Oct 9, 2023

Uh oh!

BugenZhao commented Nov 30, 2023

Uh oh!

feat: allow drop streaming jobs during recovery #12203

feat: allow drop streaming jobs during recovery #12203

Comments

yezizp2012 commented Sep 11, 2023

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

yezizp2012 commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artorias1024 commented Sep 12, 2023

Uh oh!

yezizp2012 commented Sep 12, 2023

Uh oh!

BugenZhao commented Sep 15, 2023

Uh oh!

yezizp2012 commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yezizp2012 commented Sep 15, 2023

Uh oh!

BugenZhao commented Sep 19, 2023

Uh oh!

BugenZhao commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yezizp2012 commented Sep 19, 2023

Uh oh!

yezizp2012 commented Sep 19, 2023

Uh oh!

zwang28 commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BugenZhao commented Oct 9, 2023

Uh oh!

zwang28 commented Oct 9, 2023

Uh oh!

BugenZhao commented Nov 30, 2023

Uh oh!

yezizp2012 commented Sep 11, 2023 •

edited

Loading

yezizp2012 commented Sep 15, 2023 •

edited

Loading

BugenZhao commented Sep 19, 2023 •

edited

Loading

zwang28 commented Oct 9, 2023 •

edited

Loading