Skip to content

adapters/kfp: support distributed training #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

adapters/kfp: support distributed training #109

wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jul 26, 2021

This adds a new resource_to_app KFP adapter that allows adapting an app to a kfp ResourceOp that launches the operator using the Volcano scheduler. This reuses the same code that creates the resources for the kubernetes scheduler and embeds the resource inside a KFP pipeline.

This isn't supported under KFP v2 since it interacts directly with kubernetes resources/volcano. This also requires volcano to be installed on the cluster to use which is why it's a new adapter instead of automatically being used.

This is still fairly experimental and once KFP has better distributed support we likely want to rely on that instead since this has some less than ideal UX. You need to use the CLI to access the individual worker logs and there isn't any support for UI metadata yet.

UI metadata I think can be added by providing an output annotation for argo as part of the resource but I haven't looked into it.

Test plan:

pyre
pytest
python dist_pipeline.py

http://5ab6bab9-istiosystem-istio-2af2-1926929629.us-west-2.elb.amazonaws.com/_/pipeline/#/runs/details/27707de9-bc67-42da-ab86-af2127ee54d1

20210726_14h12m04s_grim

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 26, 2021
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This adds a new `resource_to_app` KFP adapter that allows adapting an app to a kfp ResourceOp that launches the operator using the Volcano scheduler. This reuses the same code that creates the resources for the kubernetes scheduler and embeds the resource inside a KFP pipeline.

This isn't supported under KFP v2 since it interacts directly with kubernetes resources/volcano. This also requires volcano to be installed on the cluster to use which is why it's a new adapter instead of automatically being used.

This is still fairly experimental and once KFP has better distributed support we likely want to rely on that instead since this has some less than ideal UX. You need to use the CLI to access the individual worker logs and there isn't any support for UI metadata yet.

UI metadata I think can be added by providing an output annotation for argo as part of the resource but I haven't looked into it.

Pull Request resolved: #109

Test Plan:
```
pyre
pytest
python dist_pipeline.py
```
http://5ab6bab9-istiosystem-istio-2af2-1926929629.us-west-2.elb.amazonaws.com/_/pipeline/#/runs/details/27707de9-bc67-42da-ab86-af2127ee54d1

![20210726_14h12m04s_grim](https://user-images.githubusercontent.com/909104/127059928-b4787429-e895-4b97-b53e-c6262e99c52b.png)

Reviewed By: kiukchung

Differential Revision: D29921246

Pulled By: d4l3k

fbshipit-source-id: b23c8ea376cb25b4b6fa3e7208c120ec783d750a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D29921246

@facebook-github-bot
Copy link
Contributor

@d4l3k merged this pull request in f6907e8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants