Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any GPU resource types support #2973

Closed
wants to merge 15 commits into from

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Oct 20, 2022

What changes were proposed in this pull request?

The K8s cluster used for GPU training may have other GPU resource types other than nvidia.com/gpu, there are other vendors like AMD or opensource vGPU implementations, like https://github.com/4paradigm/k8s-vgpu-scheduler, https://github.com/tkestack/gpu-manager etc. I propose to add support for setup user inputed resource types.
截屏2022-10-20 16 09 41

How was this pull request tested?

Tests added at test_pipeline_constructor.py

TODO:
Docs should updated in another PR.

@elyra-bot
Copy link

elyra-bot bot commented Oct 20, 2022

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@akchinSTC akchinSTC self-requested a review October 20, 2022 16:38
Comment on lines 267 to 268
gpu_vendor = self.pipeline_envs.get("GPU_VENDOR", "nvidia")
self.container.set_gpu_limit(gpu=str(gpu_limit), vendor=gpu_vendor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although gpu_vendor was not a member variable previously, we should set the member. Also, for consistency, we should reference gpu_limit via self.

Suggested change
gpu_vendor = self.pipeline_envs.get("GPU_VENDOR", "nvidia")
self.container.set_gpu_limit(gpu=str(gpu_limit), vendor=gpu_vendor)
self.gpu_vendor = self.pipeline_envs.get("GPU_VENDOR", "nvidia")
self.container.set_gpu_limit(gpu=str(self.gpu_limit), vendor=self.gpu_vendor)

Comment on lines 272 to 273
if not self.gpu_memory_vendor:
raise ValueError("gpu_memory_vendor not seted while gpu_memory is specified > 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Suggested change
if not self.gpu_memory_vendor:
raise ValueError("gpu_memory_vendor not seted while gpu_memory is specified > 0")
if not self.gpu_memory_vendor:
raise ValueError("gpu_memory_vendor not set while gpu_memory is specified")

Question:
This PR is making gpu_memory_vendor a required value. Is there a reasonably common default (like "nvidia" is for gpu_vendor) that we can fall back to? Are the same vendor names used for gpu_limit and gpu_memory? I.e., can we use env GPU_VENDOR here as well if not provided by the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nvidia" I would consider to be the default due to the market share they hold. Not sure about the vendor spec when it comes to gpu limit and memory, I always assumed these would be agnostic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
This PR is making gpu_memory_vendor a required value. Is there a reasonably common default (like "nvidia" is for gpu_vendor) that we can fall back to? Are the same vendor names used for gpu_limit and gpu_memory? I.e., can we use env GPU_VENDOR here as well if not provided by the user?

Thanks for reviewing!

Actually, gpu_vendor, gpu_memory and gpu_memory_vendor are ment to be optional. I'm looking for a way to make "nvidia" as default, as it is before, and make gpu_vendor, gpu_memory and gpu_memory_vendor to be optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I've tested we can leave gpu_vendor, gpu_memory and gpu_memory_vendor blank, set gpu to 1, we'll have a step with default gpu resource setup like nvidia.com/gpu: 1.

@akchinSTC akchinSTC added status:Waiting for Author component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines labels Oct 24, 2022
@typhoonzero typhoonzero changed the title [WIP] Allow any GPU resource types support Allow any GPU resource types support Oct 25, 2022
@ptitzler
Copy link
Member

Please make sure that Kubeflow Pipelines actually accepts custom values. Elyra uses ContainerOp to implement generic components (running notebooks etc), which seems to limit vendors to nvidia and amd (search for def set_gpu_limit in above link).

@typhoonzero
Copy link
Contributor Author

Please make sure that Kubeflow Pipelines actually accepts custom values. Elyra uses ContainerOp to implement generic components (running notebooks etc), which seems to limit vendors to nvidia and amd (search for def set_gpu_limit in above link).

Thanks for reminding, I'm using ContainerOp's member add_resource_limit to add any resource types, see:

https://github.com/elyra-ai/elyra/pull/2973/files#diff-8baadb91039b47b451a9ecb6a166cbcddbe75bb825776ab4eac9d7c9e6f9170cR274-R275

Copy link
Member

@akchinSTC akchinSTC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @typhoonzero, thanks for the contribution.
Things looking good, looking for more opinions from other members. For me, im wondering if there is a way to put these new parameters under a "click to show" type of button since these fields are starting to take up a lot of real estate.
drawing

"minimum": 0
},
"gpu_memory_vendor": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the gpu_memory_vendor param exclusive to virtualized GPU resources? If so, can we change the name to something like vgpu.....

@typhoonzero
Copy link
Contributor Author

Moved to #3029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines inactive:obsolete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants