-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Updated Kops GPU Setup Hook #4971
Conversation
* Changed Dockerfile base image to debian for systemctl and bash. * Added autodetect of AWS ec2 instanceclass p2, p3, g3. * For each detected instance class, added the installation of the proper driver according to the specific NVIDIA hardware. - G3 instance types require Nvidia Grid Series/Grid K520 drivers - P2 instance types require Nvidia Tesla K-Series drivers - P3 instance types require Nvidia Tesla V-Series drivers * Set custom nvidia-smi configurations according to nvidia hardware per ec2 instanceclass, according to the AWS GPU optimization document. * Added the installation and patches of the latest cuda 9.1 libraries. * Added restart of kubelet on kube node at end of successful hook run, thereby fixing a race condition where kubelet would start before the Nvidia drivers were loaded, thus not allowing kubernetes to detect GPUS on the kube node. * Ensured build of nvidia drivers used same gcc version as that which built default kops kernel. * Fixed issue where *every* run of this container would download all the NVIDIA drivers + cuda libs (1GB+), by caching the files on the kube node. * Fixed issue where after reboot, subsequent runs of this script would fail because mknod would try to create a previously-created device node and fail. This previously caused download loop as systemd perpetually restarted the unit upon failure. * Tested with p2.xlarge, p3.2xlarge, and g3.4xlarge
/assign @KashifSaadat |
chroot ${ROOTFS_DIR} $filepath_host --accept-eula --silent | ||
touch $filepath_installed # Mark successful installation | ||
else | ||
echo "Unable to handle file $filepath_host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this exit 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. New docker image tag at: dcwangmit01/aws-nvidia-bootstrap:0.1.1
echo "Installing file $filename on host" | ||
if [[ $download =~ .*NVIDIA.* ]]; then | ||
# Install the nvidia package (using gcc-7) | ||
chroot ${ROOTFS_DIR} /bin/bash -c "CC=/usr/bin/gcc-7 $filepath_host --accept-license --silent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these options (with the default directory) compatible with deviceplugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the deviceplugins this is the Google approach (alternative to the Nvidia one) to prepare the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch does not take the deviceplugin approach, but it also does not preclude it. It does not install nvidia-docker, and it does not swap the default container runtime. Those could be added on top if we wanted to implement the deviceplugins over the continuation of the existing method. That could be a pull request on top.
The linked PR is interesting because it installs drivers via a daemonset. If one didn't mind running containers on Kubernetes in privileged mode, it would be an interesting alternative to kops hooks. It's also nice because one could deploy via helm chart rather than editing kops instancegroup manifests.
I did take a look at the Google setup script, hoping to ditch what I just wrote in this PR. Unfortunately, just like this current PR, the setup instructions are cloud specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that accellerator is already deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a fair point, and a fine issue to address in a follow-on PR. The changes here are still dependencies to make deviceplugins work. Consider it a solid step in that direction. Hopefully someone can take it the last mile, in a logical follow-on PR.
Also, note that without these changes GPUs on AWS P3 and G3 instances don't work with kops today on 1.8, 1.9, or 1.10 (released 2 weeks ago).
The question we have to ask ourselves is does this PR move the ball toward the goal line. If not, then no worries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for now just to care if there is some particular directories setup needed as a parameter by the Nvidia installer that could help deviceplugins support. If you check I think that the Google solution it is trying to pass good options to the Nvidia installer.
/cc @RenaudWasTaken Any feedback? |
Also adding @flx42, will comment as soon as I have some time :) |
I'm testing this new setup using P2 instances with HorizontalPodAutoscaler and cluster-autoscaler to test dynamically scaling GPU nodes. I'm seeing that when a new instance is initialized, the container I'm using to run a GPU process (tensorflow serving) starts before the setup hook finishes running and does not use the GPU (it does not fail, it just uses CPUs). Is there a way to stop pods from running until the setup hook finishes? Or should this problem be handled via adding to the container CMD a script that waits or fails until the GPU is available? |
@richardbrks Thanks for testing out the PR. I hope it works for you. Regarding your question:
Yes, there is a way. Be sure you are setting a gpu-limit in your pod spec.
Do a "kubectl describe nodes" and look under Capacity for any node. If you have set things up correctly you should see for nodes that don't have GPU and for nodes that have GPU but has not had the hook finish running, the following capacity.
At the end of the hook run the kubelet is restarted. Only at this point does the Capacity get updated to:
At this point, only if you have set the nvidia-gpu limit in your pod specification, will any pod tagged with such limit start running on the node. I actually had the inverse problem of non-gpu pods running on the GPU machines. This was easily taken care of by taints and tolerations. |
@dcwangmit01 that worked! Thanks for your help (and for this PR)! |
# AWS Instance Types to Nvidia Card Mapping (cut and pasted from AWS docs) | ||
# Load the correct driver for the correct instance type | ||
# Instances Product Type Product Series Product | ||
# G2 GRID GRID Series GRID K520 <-- I think they meant G3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, but I think G2 is a deprecated instance type. It no longer appears on the pricing page, though we have our own G2 instances running.
According to https://aws.amazon.com/ec2/instance-types/ and https://aws.amazon.com/ec2/instance-types/g3/, G3 instances are based on Tesla M60s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@115100 Thanks for the clarifications.
So, based on what you said:
-
We don't have to worry about G2 instances because one cannot spin them up because of deprecation (am I wrong?)
-
The driver package for the G3 instance is suboptimal but working (I tested it). I'm looking at the Nvidia website right now and Tesla M60 hardware resolves to the following driver, which matches that of P2 and P3.
http://us.download.nvidia.com/tesla/390.46/NVIDIA-Linux-x86_64-390.46.run
I can make the doc change and driver swap later after more feedback, and if we think this PR has any chance of merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G2 instances can still be launched but Nvidia's own drivers don't support kernels >=4.9 - I had to write my own patch to get the driver to work. Personally, I think it's safe to ignore G2.
On G3, I don't have any instances up to test them but the rest of the PR looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvidia's own drivers don't support kernels >=4.9
That's not true. You didn't pick the right driver package then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see a new set of drivers. It was a while back and it did take a while to support the newer kernel though.
Please don't use this resource anymore. It was deprecated in 1.10 (kubernetes/kubernetes#57384) and is getting removed in 1.11 (kubernetes/kubernetes#61498). Use device plugins that introduce
Yes. Taints and Tolerations is the right approach for this. If you use device plugins and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions
@@ -12,9 +12,12 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
FROM alpine:3.6 | |||
FROM debian:jessie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the switch? We should probably use the base k8s container ... see protokube
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind I see why you switched. We should probably use the same container as protokube.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must have or should have? The difference is I've already got a lot of mileage on the existing base image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're switching away from alpine generally, AIUI, so +1 to debian.
/ok-to-test /assign @rdrgmnzs @mikesplain Can I get a review but another bash hacker? Any comments from anyone else?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the bash perspective this looks like some good improvements. Good organization.
/lgtm
Could we suggest to enable |
Great work, thanks for the contribution @dcwangmit01! The comments are also very helpful in understanding the flow. I'd suggest maybe making some amendments to the related GPU documentation in regards to your findings / quirks, deprecation of Otherwise this LGTM 👍 |
Seems that the repository matrix already support debian distributions. What do you think? |
This LGTM, but... I'm a little confused by @bhack 's comments. Should we merge this @dcwangmit01 , or should we switch to use the container-engine-accelerators? Or both :-) ? |
@justinsb There's nothing in this PR that precludes or is incompatible with device plugins which is required for kubernetes >= 1.11.0, which kops does not officially support yet. What's in this PR is better than what currently exists. I'd like to see it merged, of course. It's not a big deal. I'm working on the device plugin version as we speak, and it uses the same code. I'll have a PR in coming weeks. The question is: do we want to help people that are still using accelerators until they are forced to upgrade to device plugins in 1.11. I'd say yes. |
I agree with the above comments to merge the PR. I have been using this PR for some time to setup gpu instances, in our k8s 1.9 cluster for ML workloads. Everything checks out nicely :) |
I think if it is ready it is better then the current kops gpu status. Then we can use Nividia container-engine-accelerators with another PR like it is trying to introduce Kubespray at kubernetes-sigs/kubespray#2913 |
Seems there is consensus to merge :-) Thanks for clarifying, and thank you for the PR @dcwangmit01 /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dcwangmit01, justinsb, mikesplain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Can we just add a reference to |
This comment has been minimized.
This comment has been minimized.
vpc limits /retest |
I just cleared out the cruft that was causing us to hit the VPC limits, so this should be getting better... /retest |
/retest |
Thanks for monitoring the tests @justinsb and @mikesplain. I've been watching as well. What's the process for updating the docker image in the kopeio repository? Is it human, or the build system? I haven't updated the readme because the image is sitting in my public dockerhub. It could be re-tagged and pushed, with the doc subsequently updated. @bhack The usage of |
Hi Any idea how to fix that? |
Hi @erez-rabih, You must be using a kops OS image where the kernel is built with the same gcc version as the distribution wants to install. That's how things should be. However, the default kops images that I've seen in the stable channel manifest all have their kernels built with GCC 7.3.0, whilst the default OS installation packages for gcc are GCC 4.9.2. The installation scripts assume a default kops image with kernel compiled with GCC 7.3.0, as specified in the stable channel manifest, and thus the gcc-7 is force-installed and then Nvidia drivers are forced to use gcc-7. This hook will not work anywhere where the kernel has not been compiled with GCC 7.3.0. Perhaps you are using an older OS image build. This will not work on debian stretch images as well, where the kernel and gcc both are gcc-6. Try upgrading to one of the current stable images in the stable channel manifest. Do a kops edit cluster, set the image, and then rolling update. This morning I spun up a few different kops images from the stable channel manifest to check kernel and gcc versions. I've pasted the output below. You should choose from one of the images. -dave
|
@dcwangmit01 Thanks for your detailed answer, unfortunately it didn't help
and the hook is set to run the image
and on nvidia-installer log:
|
@erez-rabih Try the 1.10 jessie image. That's the one I've used. Also try the new PR in legacy mode here: #5502 |
@dcwangmit01 there is no jessie 1.10 |
@erez-rabih Let's move this conversation into an issue, if it needs to continue. Here's how to find images.
|
Docker image for testing located at: dcwangmit01/aws-nvidia-bootstrap:0.1.1
Compatible with instructions here: https://github.com/kubernetes/kops/blob/master/docs/gpu.md
===
according to the specific NVIDIA hardware.
instanceclass, according to the AWS GPU optimization document.
fixing a race condition where kubelet would start before the Nvidia drivers
were loaded, thus not allowing kubernetes to detect GPUS on the kube node.
default kops kernel.
drivers + cuda libs (1GB+), by caching the files on the kube node.
because mknod would try to create a previously-created device node and fail.
This previously caused download loop as systemd perpetually restarted the
unit upon failure.