Update Orchestrator to Latest Changes and Prepare for Future L3 Support #524

heemankv · 2025-03-08T06:47:34Z

Description

This PR brings the orchestrator in this repo up to date with its latest changes. Moving forward, all orchestrator-related PRs will be raised directly in this repo, allowing us to archive the madara-orch repo.

Additionally, this update makes the orchestrator more production-ready by resolving a memory fabrication issue using jammalloc. We’ve also refactored the logic to better support future L3 integrations.

PR Type

Feature

Other Information

This update ensures a more stable and scalable orchestrator, setting the foundation for future improvements.

orchestrator/crates/orchestrator/src/cli/service.rs

notlesh · 2025-03-17T17:40:14Z

This PR brings the orchestrator in this repo up to date with its latest changes.

For clarity, does this mean that it picks up from orchestrator PRs that were abandoned (because they were done while the monorepo was in progress)?

If so, it would be good to link to the relevant PRs.

Mohiiit · 2025-03-19T04:42:59Z

Yes this includes all the changes that were done to the orch while monorepo was getting merged, links to the PR merged:

cc: @notlesh

raynaudoe

uACK, left a few comments

orchestrator/crates/orchestrator/src/cli/mod.rs

orchestrator/crates/orchestrator/src/config.rs

orchestrator/crates/orchestrator/src/cron/mod.rs

raynaudoe · 2025-03-20T11:12:49Z

orchestrator/crates/orchestrator/src/helpers.rs

+    }
+
+    pub async fn get_active_jobs(&self) -> HashSet<Uuid> {
+        self.active_jobs.lock().await.clone()


nit: a read only lock might be a little more efficient

would like to hear more on this, can you provide more context / we can also hop on a meet if you'd like

I was trying to understand the use case (to understand if we really want only one reader), but seems that this function is never call

Yeah, you are right, this was being used in a more complex implementation earlier, I guess we missed removing it, removing it now.
Great catch!

raynaudoe · 2025-03-20T11:19:52Z

orchestrator/crates/orchestrator/src/helpers.rs

+            Ok(Ok(permit)) => {
+                {
+                    let mut active_jobs = self.active_jobs.lock().await;
+                    active_jobs.insert(job.id);


Maybe check here if active_jobs already has job.id

makes sense, let's discuss this with your idea of a read only lock,

orchestrator/crates/orchestrator/src/setup/mod.rs

orchestrator/crates/orchestrator/src/tests/config.rs

orchestrator/crates/orchestrator/src/workers/update_state.rs

raynaudoe · 2025-03-20T13:54:21Z

orchestrator/crates/orchestrator/src/workers/update_state.rs

                    log::warn!("DA job for the first block is not yet completed. Returning safely...");
                    return Ok(());
                }
            }
        }

-        let mut blocks_to_process: Vec<u64> = find_successive_blocks_in_vector(blocks_to_process);
-
+        let mut blocks_to_process = find_successive_blocks_in_vector(blocks_to_process);


What if this vector cannot find any successive blocks or there are dangling blocks that don't have their corresponding neighbours? Those blocks would eventually get processed, right?

raynaudoe · 2025-04-04T19:06:19Z

@heemankv can you please merge main so the branch has the latest fixes for CI ?

orchestrator/crates/orchestrator/src/jobs/snos_job/mod.rs

notlesh · 2025-04-04T20:11:20Z

orchestrator/crates/orchestrator/src/jobs/state_update_job/mod.rs

    async fn update_state_for_block(
        &self,
        config: Arc<Config>,
        block_no: u64,
        snos: StarknetOsOutput,
        nonce: u64,
+        program_output: Vec<[u8; 32]>,


Why not use a Felt as the type here?

This is because as of now this code-piece serves both Ethereum and Starknet,
IMO it would be very type intensive to change it just for Starknet.
Let me see what I can do

Is there any specific advantage that we achieve when we use the type Felt here (which we can only use for starknet) or is it to ensure type consistency ?
I think we can create an issue for this and handle this later if this is for type consistency, since this isn't a part of this PR's scope and would be a really Good First Issue.
Wdyt ?

If it's used for multiple types, then Felt doesn't make sense. Seems fine for now, a Good First Issue for someone with some inspiration would be good :)

orchestrator/crates/orchestrator/src/tests/jobs/da_job/mod.rs

notlesh · 2025-04-04T20:18:46Z

orchestrator/crates/orchestrator/src/tests/jobs/mod.rs

+    let mut job_item = build_job_item(JobType::DataSubmission, JobStatus::PendingVerification, 1);
+
+    // Set process_attempt_no to 1 to simulate max attempts reached
+    job_item.metadata.common.process_attempt_no = 1;


Is there a constant or some value you can query instead of hard coding this to 1?

...Or one you could define here in the code

I think this is just for testing purposes,
we are putting in 1 here to overcome the value defined here :

madara/orchestrator/crates/orchestrator/src/jobs/da_job/mod.rs

Lines 197 to 199 in 041b6f1

fn max_process_attempts(&self) -> u64 {

1

}

notlesh · 2025-04-04T20:20:34Z

orchestrator/crates/orchestrator/src/tests/jobs/mod.rs

-    // create a job
+    // Create a job


These kind of changes create a lot of noise (and massive headaches / merge conflicts when merging code), are you doing this manually or does your editor do it for you? 😅

Haha, I understand, it's the editor doing it by itself, IMO it's correct only, I don't see any harm in it.
lemme know if there are many like these

Yeah, the entire PR was littered with changes like that.

Alright, I understand,
We'll make a note of it for sure,
Do you believe these changes should be reverted ?

It's probably fine unless others feel strongly about it. If you try to remove it, git's -p switch can be useful (e.g. git checkout -p or git add -p when preparing a commit).

notlesh · 2025-04-04T20:22:20Z

orchestrator/crates/orchestrator/src/tests/jobs/snos_job/mod.rs

+
+    // Update job_item to use the proper metadata structure for SNOS jobs
+    job_item.metadata.specific = JobSpecificMetadata::Snos(SnosMetadata {
+        block_number: 0,


All of these tests so far have used block 0, which is a very special case -- it might be good to test with other cases

IMO that shouldn't matter for these tests, since its just verifying against mock DB server if it's able to interact with stuff or not.

notlesh · 2025-04-04T20:23:54Z

orchestrator/crates/orchestrator/src/tests/jobs/state_update_job/mod.rs

+        specific: JobSpecificMetadata::StateUpdate(StateUpdateMetadata {
+            blocks_to_settle: vec![6, 7, 8], // Gap between 4 and 6
+            snos_output_paths: vec![
+                format!("{}/{}", 6, SNOS_OUTPUT_FILE_NAME),


This pattern ("{}/{}") is repeated a lot in this PR, it could be really helpful for readability/documentation/DRY to create a util fn for it

Created an issue for this here : #568

notlesh · 2025-04-04T20:33:48Z

orchestrator/crates/orchestrator/src/tests/utils.rs

+                tx_hash: None,
+            }),
+        },
+        _ => panic!("Invalid job type"),


I'd leave this out so the compiler can help

Since ProofRegistration is added but yet not handled we have to keep this statement, It shall be removed after L3 support for orchestrator gets merge.

orchestrator/crates/orchestrator/src/tests/workers/proving/mod.rs

notlesh

Looks good for the most part, I left a lot of comments but I don't have too much context so not all of it is relevant.

raynaudoe · 2025-04-07T11:29:00Z

Hi @heemankv, can you please check the failing tests?

HermanObst

@heemankv Some preliminary comments, Ill continue with the rest of the PR :)

HermanObst · 2025-04-07T19:35:59Z

e2e-tests/tests.rs

+
+    // Create the StateUpdate-specific metadata
+    let state_update_metadata = StateUpdateMetadata {
+        blocks_to_settle: vec![block_number],


For now this can only be 1 block, right?
I guess this is in preparation for Applicative recursion?

This is a helper function that adds the block -1 state update job to the db, where block is the one being tested in e2e. This is to ensure that the job is created by the orch's state update logic.

In preparation for Applicative recursion

No this is not for AR, rather we decided to do 10 (arbitrary number) block's state update together just so save up on the process of sending a job to queue and waiting for it to be picked, Since there's no way for us to do state update in parallel (due to nonce issue), we manually manage the nonce while sending upto 10 SU together.

HermanObst · 2025-04-07T19:37:30Z

e2e-tests/tests.rs

@@ -357,13 +378,26 @@ pub async fn mock_proving_job_endpoint_output(sharp_client: &mut SharpClient) {

 /// Puts after SNOS job state into the database
 pub async fn put_job_data_in_db_da(mongo_db: &MongoDbServer, l2_block_number: String) {
+    // Create the DA-specific metadata
+    let da_metadata = DaMetadata {
+        block_number: l2_block_number.parse::<u64>().unwrap() - 1,


Why is this -1?

Also, it seems to be the same as the internal_id

I would make sure this can't underflow (do a checked_sub() or saturating_sub()).

This is a helper function that adds the block -1 DA job to the db, where block is the one being tested in e2e. This is to brute force way to ensure that the job is created by the orch's DA logic.

HermanObst · 2025-04-08T12:15:59Z

orchestrator/crates/orchestrator/src/helpers.rs

+pub struct JobProcessingState {
+    pub semaphore: Semaphore,
+    pub active_jobs: Mutex<HashSet<Uuid>>,
+}


@heemankv do we need a Mutex to protect HashSet<Uuid>?
This will enforce that only one thread can write or read.

Maybe you just need RwLock that allows many threads to read the data (but only one can write)

Update: in the latest code, we've removed active_jobs since it was not being used.

HermanObst · 2025-04-08T12:20:30Z

orchestrator/crates/orchestrator/src/helpers.rs

+    }
+
+    pub async fn get_active_jobs(&self) -> HashSet<Uuid> {
+        self.active_jobs.lock().await.clone()


I was trying to understand the use case (to understand if we really want only one reader), but seems that this function is never call

orchestrator/crates/orchestrator/src/helpers.rs

HermanObst · 2025-04-08T18:44:33Z

orchestrator/crates/orchestrator/src/jobs/da_job/mod.rs

@@ -344,14 +392,16 @@ pub async fn state_update_to_blob_data(
 }

 /// To store the blob data using the storage client with path <block_number>/blob_data.txt


Is this still true?

not as of now, paths are provided by the worker itself, so ideally it shouldn't be here now

Removing this comment since now paths are provided by the worker itself.

HermanObst · 2025-04-08T19:53:14Z

orchestrator/crates/orchestrator/src/jobs/metadata/common.rs

+    #[serde(with = "chrono::serde::ts_seconds_option")]
+    pub verification_completed_at: Option<DateTime<Utc>>,
+    /// Reason for job failure if any
+    pub failure_reason: Option<String>,


Do we know the possible errors?
If we do, it always makes sense to include those types in the type system, rather than a concatenation of chars

I agree, and in the refactor we have included error types, it would be part of the refactor effort

HermanObst · 2025-04-08T20:03:08Z

orchestrator/crates/orchestrator/src/jobs/metadata/mod.rs

+/// Macro to implement TryInto for JobSpecificMetadata variants
+macro_rules! impl_try_into_metadata {
+    ($variant:ident, $type:ident) => {
+        impl TryInto<$type> for JobSpecificMetadata {
+            type Error = eyre::Error;
+
+            fn try_into(self) -> Result<$type, Self::Error> {
+                match self {
+                    JobSpecificMetadata::$variant(metadata) => Ok(metadata),
+                    _ => Err(eyre!(concat!("Invalid metadata type: expected ", stringify!($variant), " metadata"))),
+                }
+            }
+        }
+    };
+}
+
+// Implement TryInto for all metadata types
+impl_try_into_metadata!(Snos, SnosMetadata);
+impl_try_into_metadata!(Proving, ProvingMetadata);
+impl_try_into_metadata!(Da, DaMetadata);
+impl_try_into_metadata!(StateUpdate, StateUpdateMetadata);


I think implementing the try into for every type its much more clear, rather than a macro. Still its just my opinion.

Anyway, could we add unit test on this to showcase usage (and ensure correctness)?

HermanObst · 2025-04-08T20:13:03Z

orchestrator/crates/orchestrator/src/jobs/metadata/proving.rs

+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub enum ProvingInputType {
+    /// Path to an existing proof
+    Proof(String),
+    /// Path to a Cairo PIE file
+    CairoPie(String),
+}


This should be named as ProvingInputTypePath, right? as this struct is a path and the type is still not created
(Actually, we are "trusting" this path points to some data that then can be converted in a meaningful type)

Valid, I've changed it to ProvingInputTypePath

HermanObst · 2025-04-08T20:16:42Z

orchestrator/crates/orchestrator/src/jobs/metadata/proving.rs

+    /// Block number to prove
+    pub block_number: u64,
+    /// Path to the input file (proof or Cairo PIE)
+    pub input_path: Option<ProvingInputType>,


How this relates to ProvingInputType?

Could you add some docs explaining it? (e.g: when is a proof, when a Cairo PIE, do they change in the job lifetime?, when can be None, etc)

I believe this impl is a little ahead of itself, ProvingInputTypePath is to be heavily used in L3 Support, given we'll send the proofs instead of the pie for the bridge proofs while sending Cairo-Pie for the first proof.

Although I would like to follow YAGNI but here given it's already written and we know we are going to use it, we might as well keep it, TBH anything works for me, wdyt ?

HermanObst

second run

HermanObst · 2025-04-09T18:09:50Z

orchestrator/crates/orchestrator/src/jobs/mod.rs

+    /// * `metadata` - Additional key-value pairs associated with the job
+    ///
+    /// # Returns
+    /// * `Result<JobItem, JobError>` - The created job item or an error
    async fn create_job(


I got confused with the create_job function that actually creates the job in db, etc.
Maybe this should be named: create_job_item? And add in docs that the different impls of this function are called by create_job depending the job_type we want to create?

I believe we can change the db function from create_job to create_job_item.
Sounds fine ? WDYT ?

orchestrator/crates/orchestrator/src/jobs/mod.rs

HermanObst · 2025-04-09T18:32:18Z

orchestrator/crates/orchestrator/src/jobs/mod.rs

+/// # Returns
+/// * `Result<(), JobError>` - Success or an error
+///
+/// # State Transitions


Suggested change

/// # State Transitions

/// # State Transitions depending Job state while entering this function

Or something like that xD

I am unsure about what you are looking for here, would appreciate elaboration.
Thanks!

HermanObst · 2025-04-09T18:32:47Z

orchestrator/crates/orchestrator/src/jobs/mod.rs

+/// # Returns
+/// * `Result<(), JobError>` - Success or an error
+///
+/// # State Transitions


Suggested change

/// # State Transitions

/// # State Transitions depending Job state while entering this function

Same as above.

I am unsure about what you are looking for here, would appreciate elaboration.
Thanks!

orchestrator/crates/orchestrator/src/jobs/mod.rs

HermanObst · 2025-04-10T22:44:05Z

orchestrator/crates/orchestrator/src/jobs/metadata/common.rs

+    pub process_attempt_no: u64,
+    /// Number of times the job has been retried after processing failures
+    pub process_retry_attempt_no: u64,
+    /// Number of times the job has been verified
+    pub verification_attempt_no: u64,
+    /// Number of times the job has been retried after verification failures
+    pub verification_retry_attempt_no: u64,


Just being annoying here: u64 do not make sense for this :P
I would say u16 is more than sufficient

Valid argument, changed.

orchestrator/crates/orchestrator/src/jobs/mod.rs

HermanObst · 2025-04-10T22:55:44Z

orchestrator/crates/orchestrator/src/jobs/metadata/common.rs

+    #[serde(with = "chrono::serde::ts_seconds_option")]
+    pub verification_completed_at: Option<DateTime<Utc>>,
+    /// Reason for job failure if any
+    pub failure_reason: Option<String>,


Maybe not for this PR, but maybe it makes sense for this to be a Option<Vec<String>>?
I dont know if its relevant for us, but basically that way we wont loose the failure reasons and would have a log of them.

cc @Mohiiit @heemankv

This is a valid argument, adding an issue to track this : #575

HermanObst

As a general comment I think the current implementation has scattered state transition logic, making it hard to reason about the full lifecycle. We should consider refactoring into a centralized state machine to improve clarity and testability.

I left some more comments, please answer :)

Moving from that, this changes LGTM

HermanObst · 2025-04-11T16:00:47Z

orchestrator/crates/orchestrator/src/jobs/mod.rs

+    /// * `metadata` - Additional key-value pairs associated with the job
+    ///
+    /// # Returns
+    /// * `Result<JobItem, JobError>` - The created job item or an error
    async fn create_job(


HermanObst · 2025-04-14T18:53:49Z

orchestrator/crates/orchestrator/src/tests/jobs/state_update_job/mod.rs

+#[case(vec![651052, 651054, 651051, 651056], "numbers aren't sorted in increasing order")]
+#[case(vec![651052, 651052, 651052, 651052], "Duplicated block numbers")]
+#[case(vec![651052, 651052, 651053, 651053], "Duplicated block numbers")]


@heemankv so if the block numbers are not ordered the system fails?
Is a way to recover?

@raynaudoe

Yeah,
No there's no way to recover from it as of now, since we expect the values to be sorted, as they are processed in that fashion

HermanObst · 2025-04-14T19:08:32Z

orchestrator/crates/orchestrator/src/workers/proving.rs

+                    // Set input path as CairoPie type
+                    input_path: snos_metadata.cairo_pie_path.map(ProvingInputTypePath::CairoPie),
+                    // Set download path if needed
+                    download_proof: None,


This will be always None?
@heemankv

Seems like this change is also ahead of itself, it is being heavily used in the L3 PR we'll be raising soon.

HermanObst · 2025-04-14T19:37:54Z

orchestrator/crates/orchestrator/src/workers/update_state.rs

                        )
                        .await?,
-                    Some(last_block_processed_in_last_job),
+                    Some(last_block_processed),
                )
            }
            None => {


@heemankv how can this happen?

Would need more context on what you are asking. 😅

update: orchestrator merge

564927d

heemankv self-assigned this Mar 8, 2025

github-project-automation bot added this to Madara Mar 8, 2025

update: linters and package.json

00bc605

heemankv force-pushed the rebase/orchestrator-272b013a branch from 2464970 to 00bc605 Compare March 8, 2025 07:07

heemankv and others added 6 commits March 8, 2025 13:11

fix: e2e-test

168667b

fix: e2e ci worflow fix

54cdaa2

fix: block production test made relaxed

95e2576

fix: e2e command fix

9e812b4

Merge branch 'main' into rebase/orchestrator-272b013a

be283ab

revert: changes made in the block production test

e580a03

Mohiiit changed the title ~~update: orchestrator merge~~ Update Orchestrator to Latest Changes and Prepare for Future L3 Support Mar 12, 2025

notlesh reviewed Mar 12, 2025

View reviewed changes

orchestrator/crates/orchestrator/src/cli/service.rs Outdated Show resolved Hide resolved

Trantorian1 moved this to In review in Madara Mar 17, 2025

Trantorian1 added feature Request for new feature or enhancement Project Orch labels Mar 17, 2025

refactor(orch-cli): removed redundant defaults from service cli args

11ce6cf

Mohiiit requested a review from notlesh March 19, 2025 05:47

raynaudoe reviewed Mar 20, 2025

View reviewed changes

Merge branch 'main' into rebase/orchestrator-272b013a

d11ea3c

notlesh reviewed Apr 4, 2025

View reviewed changes

orchestrator/crates/orchestrator/src/jobs/snos_job/mod.rs Show resolved Hide resolved

notlesh reviewed Apr 4, 2025

View reviewed changes

orchestrator/crates/orchestrator/src/tests/jobs/da_job/mod.rs Outdated Show resolved Hide resolved

notlesh reviewed Apr 4, 2025

View reviewed changes

orchestrator/crates/orchestrator/src/tests/workers/proving/mod.rs Outdated Show resolved Hide resolved

notlesh suggested changes Apr 4, 2025

View reviewed changes

heemankv mentioned this pull request Apr 8, 2025

Orchestrator: Add a macro for "{}/{}" #568

Open

Merge branch 'main' into rebase/orchestrator-272b013a

878f846

HermanObst reviewed Apr 8, 2025

View reviewed changes

Merge branch 'main' into rebase/orchestrator-272b013a

fe3d3ae

HermanObst reviewed Apr 10, 2025

View reviewed changes

heemankv added 2 commits April 11, 2025 14:14

removed unused code

dc2097f

added comments

5e5099f

This was referenced Apr 11, 2025

Orchestrator: Simplify tracing logs #574

Open

Orchestrator: Retain all failure reasons. #575

Open

heemankv added 4 commits April 11, 2025 16:28

update: PR review correction

9bed503

update: Removed active_jobs

ae9f170

update: min max block for snos rework

c53a212

update: fix e2e

a5e528c

HermanObst approved these changes Apr 14, 2025

View reviewed changes

notlesh approved these changes Apr 14, 2025

View reviewed changes

heemankv added 3 commits April 15, 2025 21:13

update: db create_job changed to create_job_item

1b37c20

update: fixed create_job_item naming

f7ef9de

Merge branch 'main' into rebase/orchestrator-272b013a

093c96e

Mohiiit merged commit a99088a into main Apr 16, 2025
21 of 26 checks passed

github-project-automation bot moved this from In review to Done in Madara Apr 16, 2025

heemankv deleted the rebase/orchestrator-272b013a branch May 22, 2025 19:45

		@@ -344,14 +392,16 @@ pub async fn state_update_to_blob_data(
		}

		/// To store the blob data using the storage client with path <block_number>/blob_data.txt

	/// # State Transitions
	/// # State Transitions depending Job state while entering this function

Update Orchestrator to Latest Changes and Prepare for Future L3 Support #524

Update Orchestrator to Latest Changes and Prepare for Future L3 Support #524

Uh oh!

Conversation

heemankv commented Mar 8, 2025 • edited by Mohiiit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR Type

Other Information

Uh oh!

Uh oh!

notlesh commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mohiiit commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raynaudoe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heemankv Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raynaudoe commented Apr 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heemankv Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heemankv Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

notlesh Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

heemankv commented Mar 8, 2025 •

edited by Mohiiit

Loading

notlesh commented Mar 17, 2025 •

edited

Loading

Mohiiit commented Mar 19, 2025 •

edited

Loading

heemankv Mar 25, 2025 •

edited

Loading

heemankv Apr 8, 2025 •

edited

Loading

heemankv Apr 11, 2025 •

edited

Loading

notlesh Apr 4, 2025 •

edited

Loading

notlesh Apr 4, 2025 •

edited

Loading

heemankv Apr 8, 2025 •

edited

Loading

heemankv Apr 11, 2025 •

edited

Loading

heemankv Apr 11, 2025 •

edited

Loading