Skip to content

Branch versioning logic #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 10, 2025
14 changes: 1 addition & 13 deletions api/thrift/orchestration.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -82,18 +82,6 @@ struct NodeInfo {
30: optional LogicalNode conf
}


/** First Pass
* NodeInstance::(name, type, conf_hash) -> #[parent_nodes]
* Node::(name, type) -> #[conf_hash]

* Second Pass
* Node::(name, type, compute_hash) -> #[parent_nodes]

* different file_hashes but same lineage_hash should all go into the same orchestrator workflow
* Node::(name, type, lineage_hash)
**/

struct NodeConnections {
1: optional list<NodeKey> parents
2: optional list<NodeKey> children
Expand Down Expand Up @@ -272,7 +260,7 @@ struct TableDependency {
* JoinParts could use data from batch backfills or upload tables when available
* When not available they shouldn't force computation of the backfills and upload tables.
**/
21: optional bool forceComputae
21: optional bool forceCompute
}

union Dependency {
Expand Down
290 changes: 290 additions & 0 deletions orchestration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@


# Branch support

We want to support "branches" that allow users to run pipelines and services with two
critical goals:

- don't pollute production datasets and end-points
- are cheap, by re-using as much of existing datasets as possible

## Scenarios within experimentation

While developing on a branch - users could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A q - wdyt about supporting environments (e.g. qa / preprod / ..). In my exp at Stripe and other places you might have a branch say named master but the datasets and topics in 'qa' would differ from 'prod'. Would we model envs differently or map them to branches?
(Can discuss this async as well - thought of it so wanted to flag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am modeling this more along the lines of sqlmesh - one prod data environment in which branches operate.

ml needs full scale parallel end-points so i was treating that as prio. Env's can be added on later also.

- make semantic updates - that change output data - eg., logic within where-s, selects, aggregations etc
- make non-semantic updates - that don't change output data - eg., spark exec memory, # of pods / workers etc
- *note that everything that is stored in metadata field within our API's are non-semantic fields*
- add new nodes
- delete existing nodes
- make changes to several compute nodes at once
- decide to merge the branch into master

With this context, the goal of this document is to develop / describe a representation to handle the above user workflows.


## Motivating Example

Legend:
```
"sq" stands for StagingQuery "j" for Join
"t" stands for table "m" for Model
"gb" for GroupBy
```

Nodes will be numbered - `gb4`, `m2` etc

Semantic changes to node notated using a plus "+".
Eg., Join `J3` becomes `J3+`

Non-Semantic changes with an asterisk "*" - `J3*`


```mermaid
---
title: Initial state of the example
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;
```

### Semantic updates

Say that, `sq1` changes semantically to `sq1+`. It is going to change the output of all
nodes downstream of it.

```mermaid
---
title: sq1 is updated semantically
---

graph TD;
sq1+-->t1+;
t1+-->gb1+;
gb1+-->j1+;
t2-->gb2;
gb2-->j1+;
j1+-->m1+;
gb1+-->j2+;

style sq1+ fill:wheat,color:black,stroke:#333
style t1+ fill:wheat,color:black,stroke:#333
style gb1+ fill:wheat,color:black,stroke:#333
style j1+ fill:wheat,color:black,stroke:#333
style j2+ fill:wheat,color:black,stroke:#333
style m1+ fill:wheat,color:black,stroke:#333
```

> A major concern here is that, if the local repository of the user is behind remote,
> we will a lot more changes than the user intends to.

One approach to mitigate this is to, only make the CLI only pick up changes to files listed as edited by
commits to the git branch.

Another approach is to force user to rebase on any change to the repo. However, this does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be the better / safer option if we're trying to reuse etc. A super outdated user repo and prod might lead us to funky / hard to debug issues I fear..

Copy link
Contributor Author

@nikhil-zlai nikhil-zlai Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning should prevent that.

I am afraid that forcing a rebase would force progress to be lost - expensive backfills, model training runs etc

guarantee that changes while the job is running is accounted for.

### Non Semantic updates
Instead, if `sq1` changes non-semantically to `sq1-`. None of the downstream nodes would change.

```mermaid
---
title: sq1 is updated non-semantically
---

graph TD;
sq1*-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;

style sq1* fill:lavender,color:black,stroke:#333
```

Depending on who is running the job we need to decide which version of the node to use
- if the branch author is causing the node to be computed we need to use `sq1*` instead of `sq1`
- if the prod flow or other authors who haven't updated `sq1` are causing the compute, we should use `sq1`
- if another branch is also updating `sq1` non semantically to `sq1**` we need to use that instead.

### Adding new nodes

Adding new leaf nodes will not impact any of the existing nodes.

```mermaid
---
title: m2 is added
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;
j2-->m2;

style m2 fill:lightgreen,color:black,stroke:#333
```


But adding non-leaf node -
as parent to existing node - would almost always cause semantic updates to nodes downstream.

```mermaid
---
title: gb3 is added
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1+;
t2-->gb2;
t3-->gb3;
gb2-->j1+;
gb3-->j1+;
j1+-->m1+;
gb1-->j2;

style t3 fill:lightgreen,color:black,stroke:#333
style gb3 fill:lightgreen,color:black,stroke:#333
style j1+ fill:wheat,color:black,stroke:#333
style m1+ fill:wheat,color:black,stroke:#333
```

One interesting case here is migrating the sql from an external system to StagingQuery of an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can't guarantee the external dataset and the staging query are semantically equivalent right? The output schemas would line up but there might be underlying differences in the datasets (e.g. an extra predicate or two)

already used table. Even though this is not a leaf node, absorbing it as same as a leaf node change
would be the right thing to do.


```mermaid
---
title: sq2 is added
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;
sq2-->t2;

style sq2 fill:lightgreen,color:black,stroke:#333
```

### Deleting existing nodes

Deleting leaf nodes is straight forward. We just need to program a cleanup mechanism
to remove data and pipelines generated by that node.

```mermaid
---
title: m1 is deleted
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;
sq2-->t2;

style m1 fill:coral,color:black,stroke:#333
```

Indirectly connected components - via table references - shouldn't be allowed to be deleted
as long as there are nodes that depend on the table. We will fail this during the sync step.

```mermaid
---
title: sq1 is deleted (not-allowed)
---

graph TD;
sq1-->t1;
t1-->gb1;
gb1-->j1;
t2-->gb2;
gb2-->j1;
j1-->m1;
gb1-->j2;
sq2-->t2;

style sq1 fill:coral,color:black,stroke:#333
```

Directly connected parents when deleted will have updates in the child node - or the compilation
would fail. In these cases it would be ideal to garbage collect upstream chain of the deleted node.

```mermaid
---
title: gb2 is deleted, j1 is updated
---

graph TD;
sq2-->t2;
sq1-->t1;
t1-->gb1;
gb1-->j1+;
t2-->gb2;
gb2-->j1+
j1+-->m1+;
gb1-->j2;

style sq2 fill:coral,color:black,stroke:#333
style t2 fill:coral,color:black,stroke:#333
style gb2 fill:coral,color:black,stroke:#333
style j1+ fill:wheat,color:black,stroke:#333
style m1+ fill:wheat,color:black,stroke:#333
```

### Isolating the changed assets

While development on a branch is in progress we need to create temporary data assets for
the semantically changed nodes - shown in yellow above. Adds, Deletes & Semantic Updates could
trigger this flow.


#### Logic to achieve isolation
Make a new copy of the conf & update the name (file name & metadata.name) -

`new_name = old_name + '_' + branch_name`

This needs to be followed by changing references in the downstream nodes -
all tables and nodes downstream will have the branch suffix.

```
# 1. cli sends file_hash_map to remote

local_file_map = repo.compiled.file_hash_map
remote_file_map = remote.file_map
deleted = remote_file_map - local_file_map
added = local_file_map - remote_file_map
updated = [k in intersect(local_file_map, remote_file_map)]
# 2. remote marks the changed files it needs

(node, lineage_hash) =>
```

### Merging changes into `main`

- Deletes should actually trigger asset and pipeline clean up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these we might want to consider 'soft' deletes (mark for hard delete n days down the line) to guard against users causing prod outages (this is not strictly our problem but it's a nice guard rail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree!

- Updates should trigger asset renaming
- Adds can work as is - we are not suffixing adds.
Loading
Loading