-
Notifications
You must be signed in to change notification settings - Fork 732
Add support cache for dynamic spec #6372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
popojk
wants to merge
8
commits into
flyteorg:master
Choose a base branch
from
popojk:Add-support-cache-for-dynamic-spec
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: Alex Wu <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
2 tasks
Code Review Agent Run Status
|
Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Code Review Agent Run Status
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tracking issue
Closes #5543
Why are the changes needed?
When executing a dynamic workflow in the Flyte backend, compiling the future.pb file before sub-node execution takes some time. If an out-of-memory (OOM) error occurs during sub-node execution, re-running the dynamic workflow requires recompiling the future.pb file, consuming additional time and computational resources. To address this, we propose caching the compiled future.pb file.
The future cache mechanism in this PR stores the compiled file as artifact data in Flyte storage via the datacatalog. The cache key, TagName, is determined by the hash of the input values. The diagrams below illustrate the data flow in the backend system:

What changes were proposed in this pull request?
Please refer to bellow diagram for understanding the procedure of how future read/write:

key changes:
1.Added Mode to TaskTemplate Metadata in flyteidl to indicate whether a task is a DynamicTask.
2.Implemented logic in flytepropeller to send cache read/write requests to datacatalog.
3.Added cache read/write functionality to the storage layer in datacatalog.
How was this patch tested?
To test how much time it saved when future file cache hit, we modify this block of code to count the time spend to run taskNodeHandler.Handle for test purpose.
To mimic OOM, we will throw panic in dynamicNodeHandler. So the Dynamic workflow will be aborted after future file compiled
then we run bellow test dynamic workflow for the first time
We use the command
pyflyte run --remote dynamic.py wf --s1="dynamic" --s2="test"
The first run took around 17 ms to compile future file
Then we remove the panic throwing and re-execute the same workflow with the same input. The future cache hit and it only took 83ns to go through the taskNodeHandler.Handle code block, which means the taskNodeHandler.Handle is skipped.
Moreover, unit tests has been added to make sure cache behaviors.
Labels
Please add one or more of the following labels to categorize your PR:
This is important to improve the readability of release notes.
Check all the applicable boxes
Related PRs
flytekit PR