-
Notifications
You must be signed in to change notification settings - Fork 0
Branch versioning logic #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
76d15ee
ad0c7c4
4beeee3
866e5c6
8f1145d
2fa86a4
a77ee24
1f14c98
35fe7ea
ce411b4
9c58f34
de72d1f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,290 @@ | ||
|
||
|
||
# Branch support | ||
|
||
We want to support "branches" that allow users to run pipelines and services with two | ||
critical goals: | ||
|
||
- don't pollute production datasets and end-points | ||
- are cheap, by re-using as much of existing datasets as possible | ||
|
||
## Scenarios within experimentation | ||
|
||
While developing on a branch - users could | ||
- make semantic updates - that change output data - eg., logic within where-s, selects, aggregations etc | ||
- make non-semantic updates - that don't change output data - eg., spark exec memory, # of pods / workers etc | ||
- *note that everything that is stored in metadata field within our API's are non-semantic fields* | ||
- add new nodes | ||
- delete existing nodes | ||
- make changes to several compute nodes at once | ||
- decide to merge the branch into master | ||
|
||
With this context, the goal of this document is to develop / describe a representation to handle the above user workflows. | ||
|
||
|
||
## Motivating Example | ||
|
||
Legend: | ||
``` | ||
"sq" stands for StagingQuery "j" for Join | ||
"t" stands for table "m" for Model | ||
"gb" for GroupBy | ||
``` | ||
|
||
Nodes will be numbered - `gb4`, `m2` etc | ||
|
||
Semantic changes to node notated using a plus "+". | ||
Eg., Join `J3` becomes `J3+` | ||
|
||
Non-Semantic changes with an asterisk "*" - `J3*` | ||
|
||
|
||
```mermaid | ||
--- | ||
title: Initial state of the example | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
``` | ||
|
||
### Semantic updates | ||
|
||
Say that, `sq1` changes semantically to `sq1+`. It is going to change the output of all | ||
nodes downstream of it. | ||
|
||
```mermaid | ||
--- | ||
title: sq1 is updated semantically | ||
--- | ||
|
||
graph TD; | ||
sq1+-->t1+; | ||
t1+-->gb1+; | ||
gb1+-->j1+; | ||
t2-->gb2; | ||
gb2-->j1+; | ||
j1+-->m1+; | ||
gb1+-->j2+; | ||
|
||
style sq1+ fill:wheat,color:black,stroke:#333 | ||
style t1+ fill:wheat,color:black,stroke:#333 | ||
style gb1+ fill:wheat,color:black,stroke:#333 | ||
style j1+ fill:wheat,color:black,stroke:#333 | ||
style j2+ fill:wheat,color:black,stroke:#333 | ||
style m1+ fill:wheat,color:black,stroke:#333 | ||
``` | ||
|
||
> A major concern here is that, if the local repository of the user is behind remote, | ||
> we will a lot more changes than the user intends to. | ||
|
||
One approach to mitigate this is to, only make the CLI only pick up changes to files listed as edited by | ||
commits to the git branch. | ||
|
||
Another approach is to force user to rebase on any change to the repo. However, this does not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this might be the better / safer option if we're trying to reuse etc. A super outdated user repo and prod might lead us to funky / hard to debug issues I fear.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. versioning should prevent that. I am afraid that forcing a rebase would force progress to be lost - expensive backfills, model training runs etc |
||
guarantee that changes while the job is running is accounted for. | ||
|
||
### Non Semantic updates | ||
Instead, if `sq1` changes non-semantically to `sq1-`. None of the downstream nodes would change. | ||
|
||
```mermaid | ||
--- | ||
title: sq1 is updated non-semantically | ||
--- | ||
|
||
graph TD; | ||
sq1*-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
|
||
style sq1* fill:lavender,color:black,stroke:#333 | ||
``` | ||
|
||
Depending on who is running the job we need to decide which version of the node to use | ||
- if the branch author is causing the node to be computed we need to use `sq1*` instead of `sq1` | ||
- if the prod flow or other authors who haven't updated `sq1` are causing the compute, we should use `sq1` | ||
- if another branch is also updating `sq1` non semantically to `sq1**` we need to use that instead. | ||
|
||
### Adding new nodes | ||
|
||
Adding new leaf nodes will not impact any of the existing nodes. | ||
|
||
```mermaid | ||
--- | ||
title: m2 is added | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
j2-->m2; | ||
|
||
style m2 fill:lightgreen,color:black,stroke:#333 | ||
``` | ||
|
||
|
||
But adding non-leaf node - | ||
as parent to existing node - would almost always cause semantic updates to nodes downstream. | ||
|
||
```mermaid | ||
--- | ||
title: gb3 is added | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1+; | ||
t2-->gb2; | ||
t3-->gb3; | ||
gb2-->j1+; | ||
gb3-->j1+; | ||
j1+-->m1+; | ||
gb1-->j2; | ||
|
||
style t3 fill:lightgreen,color:black,stroke:#333 | ||
style gb3 fill:lightgreen,color:black,stroke:#333 | ||
style j1+ fill:wheat,color:black,stroke:#333 | ||
style m1+ fill:wheat,color:black,stroke:#333 | ||
``` | ||
|
||
One interesting case here is migrating the sql from an external system to StagingQuery of an | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but we can't guarantee the external dataset and the staging query are semantically equivalent right? The output schemas would line up but there might be underlying differences in the datasets (e.g. an extra predicate or two) |
||
already used table. Even though this is not a leaf node, absorbing it as same as a leaf node change | ||
would be the right thing to do. | ||
|
||
|
||
```mermaid | ||
--- | ||
title: sq2 is added | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
sq2-->t2; | ||
|
||
style sq2 fill:lightgreen,color:black,stroke:#333 | ||
``` | ||
|
||
### Deleting existing nodes | ||
|
||
Deleting leaf nodes is straight forward. We just need to program a cleanup mechanism | ||
to remove data and pipelines generated by that node. | ||
|
||
```mermaid | ||
--- | ||
title: m1 is deleted | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
sq2-->t2; | ||
|
||
style m1 fill:coral,color:black,stroke:#333 | ||
``` | ||
|
||
Indirectly connected components - via table references - shouldn't be allowed to be deleted | ||
as long as there are nodes that depend on the table. We will fail this during the sync step. | ||
|
||
```mermaid | ||
--- | ||
title: sq1 is deleted (not-allowed) | ||
--- | ||
|
||
graph TD; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1; | ||
t2-->gb2; | ||
gb2-->j1; | ||
j1-->m1; | ||
gb1-->j2; | ||
sq2-->t2; | ||
|
||
style sq1 fill:coral,color:black,stroke:#333 | ||
``` | ||
|
||
Directly connected parents when deleted will have updates in the child node - or the compilation | ||
would fail. In these cases it would be ideal to garbage collect upstream chain of the deleted node. | ||
|
||
```mermaid | ||
--- | ||
title: gb2 is deleted, j1 is updated | ||
--- | ||
|
||
graph TD; | ||
sq2-->t2; | ||
sq1-->t1; | ||
t1-->gb1; | ||
gb1-->j1+; | ||
t2-->gb2; | ||
gb2-->j1+ | ||
j1+-->m1+; | ||
gb1-->j2; | ||
|
||
style sq2 fill:coral,color:black,stroke:#333 | ||
style t2 fill:coral,color:black,stroke:#333 | ||
style gb2 fill:coral,color:black,stroke:#333 | ||
style j1+ fill:wheat,color:black,stroke:#333 | ||
style m1+ fill:wheat,color:black,stroke:#333 | ||
``` | ||
|
||
### Isolating the changed assets | ||
|
||
While development on a branch is in progress we need to create temporary data assets for | ||
the semantically changed nodes - shown in yellow above. Adds, Deletes & Semantic Updates could | ||
trigger this flow. | ||
|
||
|
||
#### Logic to achieve isolation | ||
Make a new copy of the conf & update the name (file name & metadata.name) - | ||
|
||
`new_name = old_name + '_' + branch_name` | ||
|
||
This needs to be followed by changing references in the downstream nodes - | ||
all tables and nodes downstream will have the branch suffix. | ||
|
||
``` | ||
# 1. cli sends file_hash_map to remote | ||
|
||
local_file_map = repo.compiled.file_hash_map | ||
remote_file_map = remote.file_map | ||
deleted = remote_file_map - local_file_map | ||
added = local_file_map - remote_file_map | ||
updated = [k in intersect(local_file_map, remote_file_map)] | ||
# 2. remote marks the changed files it needs | ||
|
||
(node, lineage_hash) => | ||
``` | ||
|
||
### Merging changes into `main` | ||
|
||
- Deletes should actually trigger asset and pipeline clean up. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. some of these we might want to consider 'soft' deletes (mark for hard delete n days down the line) to guard against users causing prod outages (this is not strictly our problem but it's a nice guard rail) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree! |
||
- Updates should trigger asset renaming | ||
- Adds can work as is - we are not suffixing adds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A q - wdyt about supporting environments (e.g. qa / preprod / ..). In my exp at Stripe and other places you might have a branch say named master but the datasets and topics in 'qa' would differ from 'prod'. Would we model envs differently or map them to branches?
(Can discuss this async as well - thought of it so wanted to flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am modeling this more along the lines of sqlmesh - one prod data environment in which branches operate.
ml needs full scale parallel end-points so i was treating that as prio. Env's can be added on later also.