adapters/kfp: support distributed training #109
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new
resource_to_app
KFP adapter that allows adapting an app to a kfp ResourceOp that launches the operator using the Volcano scheduler. This reuses the same code that creates the resources for the kubernetes scheduler and embeds the resource inside a KFP pipeline.This isn't supported under KFP v2 since it interacts directly with kubernetes resources/volcano. This also requires volcano to be installed on the cluster to use which is why it's a new adapter instead of automatically being used.
This is still fairly experimental and once KFP has better distributed support we likely want to rely on that instead since this has some less than ideal UX. You need to use the CLI to access the individual worker logs and there isn't any support for UI metadata yet.
UI metadata I think can be added by providing an output annotation for argo as part of the resource but I haven't looked into it.
Test plan:
http://5ab6bab9-istiosystem-istio-2af2-1926929629.us-west-2.elb.amazonaws.com/_/pipeline/#/runs/details/27707de9-bc67-42da-ab86-af2127ee54d1