-
Notifications
You must be signed in to change notification settings - Fork 238
Feat: Allow specifying a minimum number of intervals to include for each model in a plan #4780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8db1a48
to
237c765
Compare
sqlmesh/cli/main.py
Outdated
@click.option( | ||
"--min-intervals", | ||
default=0, | ||
help="In new environments created against a specific time range, ensure that models contain at least this many intervals", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this only impacts new environments, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was meant to say dev environments (there was a check that threw an error if you specified this on prod).
I've updated the text and removed the dev environment check because specifying this on prod
is harmless in the sense that it doesnt do anything because prod
doesnt support --start
and --end
so already considers the full time range
237c765
to
718ea0f
Compare
718ea0f
to
0ed5945
Compare
0ed5945
to
3a1a7c3
Compare
sqlmesh/core/context.py
Outdated
if not snapshot: | ||
continue | ||
|
||
starting_point = plan_end_dt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use interval_end_per_model
here instead of a global end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's an explicit goal not to backfill anything beyond what exists in prod, yes.
I've updated this
@@ -1553,6 +1573,7 @@ def plan_builder( | |||
), | |||
end_bounded=not run, | |||
ensure_finalized_snapshots=self.config.plan.use_finalized_state, | |||
start_override_per_model=start_override_per_model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the name be consistent with interval_end_per_model
? I don't care which one is it, but I feel like they represent similar thing and should be named similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed interval_end_per_model
to end_override_per_model
on the Plan to match
sqlmesh/core/snapshot/definition.py
Outdated
snapshot_start_date = start_dt | ||
|
||
snapshot_start_override = start_override_per_model.get(snapshot.name, None) | ||
snapshot_start_date = snapshot_start_override or start_dt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snapshot_start_date = start_override_per_model.get(snapshot.name, start_dt)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated
02bbb5f
to
9c5dfc2
Compare
sqlmesh/core/context.py
Outdated
# for example, A(hourly) <- B(daily) | ||
# if min_intervals=1, A would have 1 hour and B would have 1 day | ||
# but B depends on A so in order for B to have 1 valid day, A needs to be expanded to 24 hours | ||
backfill_dag = self.dag.prune(*backfill_model_fqns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we can reuse this DAG? Wouldn't the loaded DAG be different if they use a selector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was fine due to the pruning, but i've updated the code to construct a new DAG
sqlmesh/core/context.py
Outdated
] | ||
|
||
# start from the leaf nodes and work back towards the root because the min_start at the root node is determined by the calculated starts in the leaf nodes | ||
for subdag in reversed_subdags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this contain overlapping subdags? Why can't we reverse the whole DAG and just traverse it in one go? So something like:
reversed_dag = dag.reversed
for model_fqn in reversed_dag:
snapshot = snapshots_by_model_fqn[model_fqn]
# Get the minimum start from all immediate children of this snapshot
min_child_start = min([
start_overrides.get(fqn, sys.max)
for fqn in reversed_dag.get(model_fqn, set())
])
# Proceed with computing the start for this snapshot and taking a min of computed start and min_child_start
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the key difference here is only checking immediate children which I missed when scanning the DAG API originally (I thought it had to be all downstream nodes).
I've adjusted it as per your suggestion and also set an override whether it's needed or not so there is always a value for each node in the start_overrides
dict
…ach model in a plan
…ld start date override
9c5dfc2
to
1e56590
Compare
backfill_models: t.Optional[t.Set[str]], | ||
modified_model_names: t.Set[str], | ||
execution_time: t.Optional[TimeLike] = None, | ||
) -> t.Tuple[t.Optional[int], t.Optional[int]]: | ||
if not max_interval_end_per_model: | ||
return None, None | ||
|
||
default_end = max(max_interval_end_per_model.values()) | ||
default_end = to_timestamp(max(max_interval_end_per_model.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why did we change this to datetime only to convert back to timestamp later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rome wasnt built in a day and the rest of the code in that method was ints.
One day I hope we will use proper types internally and push TimeLike
and co back to the edges / user input handling only
This addresses part of issue #4069 , albeit in a slightly different way to what is described in the ticket.
This PR adds a new plan option,
--min-intervals
, intended to be used like so:What this option does is ensure that all models in the new environment have at least 1 interval backfilled, even if their interval unit is larger than the relative time period specified for
--start
.It does this by allowing a list of per-model start date overrides to be supplied to a plan (similar to the existing
interval_end_per_model
argument). If there is a start date override available for a given snapshot, it gets used, otherwise the plan start date gets used.Thus,
--min-intervals
is implemented in terms of calculating the earliest start date that would be needed to cover--min-intervals
intervals. If this calculated date is earlier than the plan start date, it is added to the start date overrides.The start date overrides are used by:
DeployabilityIndex.create()
to ensure that the adjusted per-model start date still results in deployable datamissing_intervals()
to override the start date that is given toSnapshot.missing_intervals()
to return intervals that can be outside the default plan boundsThe immediate use-case is for PR environments created by the CI/CD bot which would allow you to say things like:
Right now these are excluded which can result in downstream daily models missing data in PR envs.
This option could be extended in future to also specify the minimum number of intervals to cover for dev previews