Skip to content

Feat: Allow specifying a minimum number of intervals to include for each model in a plan #4780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 4, 2025

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Jun 22, 2025

This addresses part of issue #4069 , albeit in a slightly different way to what is described in the ticket.

This PR adds a new plan option, --min-intervals, intended to be used like so:

sqlmesh plan dev --start '2 weeks ago' --min-intervals 1

What this option does is ensure that all models in the new environment have at least 1 interval backfilled, even if their interval unit is larger than the relative time period specified for --start.

It does this by allowing a list of per-model start date overrides to be supplied to a plan (similar to the existing interval_end_per_model argument). If there is a start date override available for a given snapshot, it gets used, otherwise the plan start date gets used.

Thus, --min-intervals is implemented in terms of calculating the earliest start date that would be needed to cover --min-intervals intervals. If this calculated date is earlier than the plan start date, it is added to the start date overrides.

The start date overrides are used by:

  • DeployabilityIndex.create() to ensure that the adjusted per-model start date still results in deployable data
  • missing_intervals() to override the start date that is given to Snapshot.missing_intervals() to return intervals that can be outside the default plan bounds

The immediate use-case is for PR environments created by the CI/CD bot which would allow you to say things like:

  • "always create PR envs with 2 weeks worth of data in them" and still have this include monthly models
  • "always create PR envs with 1 days worth of data in them" and still have this include weekly and monthly models

Right now these are excluded which can result in downstream daily models missing data in PR envs.

This option could be extended in future to also specify the minimum number of intervals to cover for dev previews

@erindru erindru force-pushed the erin/pr-min-intervals branch 3 times, most recently from 8db1a48 to 237c765 Compare June 23, 2025 22:33
@erindru erindru changed the title Feat: Allow specifying a minimum number of intervals to include for dev plans with a relative start date Feat: Allow specifying a minimum number of intervals to include for dev plans Jun 23, 2025
@click.option(
"--min-intervals",
default=0,
help="In new environments created against a specific time range, ensure that models contain at least this many intervals",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this only impacts new environments, does it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was meant to say dev environments (there was a check that threw an error if you specified this on prod).

I've updated the text and removed the dev environment check because specifying this on prod is harmless in the sense that it doesnt do anything because prod doesnt support --start and --end so already considers the full time range

@erindru erindru force-pushed the erin/pr-min-intervals branch from 237c765 to 718ea0f Compare June 27, 2025 01:46
@erindru erindru changed the title Feat: Allow specifying a minimum number of intervals to include for dev plans Feat: Allow specifying a minimum number of intervals to check during missing_intervals() Jun 27, 2025
@erindru erindru force-pushed the erin/pr-min-intervals branch from 718ea0f to 0ed5945 Compare June 30, 2025 04:31
@erindru erindru changed the title Feat: Allow specifying a minimum number of intervals to check during missing_intervals() Feat: Allow specifying a minimum number of intervals to include for each model in a plan Jun 30, 2025
@erindru erindru force-pushed the erin/pr-min-intervals branch from 0ed5945 to 3a1a7c3 Compare June 30, 2025 04:39
if not snapshot:
continue

starting_point = plan_end_dt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use interval_end_per_model here instead of a global end?

Copy link
Collaborator Author

@erindru erindru Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's an explicit goal not to backfill anything beyond what exists in prod, yes.

I've updated this

@@ -1553,6 +1573,7 @@ def plan_builder(
),
end_bounded=not run,
ensure_finalized_snapshots=self.config.plan.use_finalized_state,
start_override_per_model=start_override_per_model,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name be consistent with interval_end_per_model? I don't care which one is it, but I feel like they represent similar thing and should be named similarly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed interval_end_per_model to end_override_per_model on the Plan to match

snapshot_start_date = start_dt

snapshot_start_override = start_override_per_model.get(snapshot.name, None)
snapshot_start_date = snapshot_start_override or start_dt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot_start_date = start_override_per_model.get(snapshot.name, start_dt)

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated

@erindru erindru force-pushed the erin/pr-min-intervals branch 4 times, most recently from 02bbb5f to 9c5dfc2 Compare July 3, 2025 04:07
# for example, A(hourly) <- B(daily)
# if min_intervals=1, A would have 1 hour and B would have 1 day
# but B depends on A so in order for B to have 1 valid day, A needs to be expanded to 24 hours
backfill_dag = self.dag.prune(*backfill_model_fqns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we can reuse this DAG? Wouldn't the loaded DAG be different if they use a selector?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was fine due to the pruning, but i've updated the code to construct a new DAG

]

# start from the leaf nodes and work back towards the root because the min_start at the root node is determined by the calculated starts in the leaf nodes
for subdag in reversed_subdags:
Copy link
Member

@izeigerman izeigerman Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this contain overlapping subdags? Why can't we reverse the whole DAG and just traverse it in one go? So something like:

reversed_dag = dag.reversed
for model_fqn in reversed_dag:
    snapshot = snapshots_by_model_fqn[model_fqn]
    
    # Get the minimum start from all immediate children of this snapshot
    min_child_start = min([
        start_overrides.get(fqn, sys.max) 
        for fqn in reversed_dag.get(model_fqn, set())
    ])
    
    # Proceed with computing the start for this snapshot and taking a min of computed start and min_child_start
    
    ```
    

Copy link
Collaborator Author

@erindru erindru Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the key difference here is only checking immediate children which I missed when scanning the DAG API originally (I thought it had to be all downstream nodes).

I've adjusted it as per your suggestion and also set an override whether it's needed or not so there is always a value for each node in the start_overrides dict

@erindru erindru force-pushed the erin/pr-min-intervals branch from 9c5dfc2 to 1e56590 Compare July 3, 2025 21:59
backfill_models: t.Optional[t.Set[str]],
modified_model_names: t.Set[str],
execution_time: t.Optional[TimeLike] = None,
) -> t.Tuple[t.Optional[int], t.Optional[int]]:
if not max_interval_end_per_model:
return None, None

default_end = max(max_interval_end_per_model.values())
default_end = to_timestamp(max(max_interval_end_per_model.values()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why did we change this to datetime only to convert back to timestamp later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rome wasnt built in a day and the rest of the code in that method was ints.

One day I hope we will use proper types internally and push TimeLike and co back to the edges / user input handling only

@erindru erindru merged commit 61455f2 into main Jul 4, 2025
27 checks passed
@erindru erindru deleted the erin/pr-min-intervals branch July 4, 2025 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants