Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Updated Kops GPU Setup Hook #4971

Merged
merged 2 commits into from
Jul 21, 2018

Conversation

dcwangmit01
Copy link
Contributor

@dcwangmit01 dcwangmit01 commented Apr 11, 2018

Docker image for testing located at: dcwangmit01/aws-nvidia-bootstrap:0.1.1

Compatible with instructions here: https://github.com/kubernetes/kops/blob/master/docs/gpu.md

===

  • Changed Dockerfile base image to debian for systemctl and bash.
  • Added autodetect of AWS ec2 instanceclass p2, p3, g3.
  • For each detected instance class, added the installation of the proper driver
    according to the specific NVIDIA hardware.
    • G3 instance types require Nvidia Grid Series/Grid K520 drivers
    • P2 instance types require Nvidia Tesla K-Series drivers
    • P3 instance types require Nvidia Tesla V-Series drivers
  • Set custom nvidia-smi configurations according to nvidia hardware per ec2
    instanceclass, according to the AWS GPU optimization document.
  • Added the installation and patches of the latest cuda 9.1 libraries.
  • Added restart of kubelet on kube node at end of successful hook run, thereby
    fixing a race condition where kubelet would start before the Nvidia drivers
    were loaded, thus not allowing kubernetes to detect GPUS on the kube node.
  • Ensured build of nvidia drivers used same gcc version as that which built
    default kops kernel.
  • Fixed issue where every run of this container would download all the NVIDIA
    drivers + cuda libs (1GB+), by caching the files on the kube node.
  • Fixed issue where after reboot, subsequent runs of this script would fail
    because mknod would try to create a previously-created device node and fail.
    This previously caused download loop as systemd perpetually restarted the
    unit upon failure.
  • Tested with p2.xlarge, p3.2xlarge, and g3.4xlarge

* Changed Dockerfile base image to debian for systemctl and bash.
* Added autodetect of AWS ec2 instanceclass p2, p3, g3.
* For each detected instance class, added the installation of the proper driver
  according to the specific NVIDIA hardware.
  - G3 instance types require Nvidia Grid Series/Grid K520 drivers
  - P2 instance types require Nvidia Tesla K-Series drivers
  - P3 instance types require Nvidia Tesla V-Series drivers
* Set custom nvidia-smi configurations according to nvidia hardware per ec2
  instanceclass, according to the AWS GPU optimization document.
* Added the installation and patches of the latest cuda 9.1 libraries.
* Added restart of kubelet on kube node at end of successful hook run, thereby
  fixing a race condition where kubelet would start before the Nvidia drivers
  were loaded, thus not allowing kubernetes to detect GPUS on the kube node.
* Ensured build of nvidia drivers used same gcc version as that which built
  default kops kernel.
* Fixed issue where *every* run of this container would download all the NVIDIA
  drivers + cuda libs (1GB+), by caching the files on the kube node.
* Fixed issue where after reboot, subsequent runs of this script would fail
  because mknod would try to create a previously-created device node and fail.
  This previously caused download loop as systemd perpetually restarted the
  unit upon failure.
* Tested with p2.xlarge, p3.2xlarge, and g3.4xlarge
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2018
@dcwangmit01 dcwangmit01 changed the title Updated Kops GPU Setup Hook [GPU] Updated Kops GPU Setup Hook Apr 11, 2018
@dcwangmit01
Copy link
Contributor Author

/assign @KashifSaadat

chroot ${ROOTFS_DIR} $filepath_host --accept-eula --silent
touch $filepath_installed # Mark successful installation
else
echo "Unable to handle file $filepath_host"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this exit 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. New docker image tag at: dcwangmit01/aws-nvidia-bootstrap:0.1.1

echo "Installing file $filename on host"
if [[ $download =~ .*NVIDIA.* ]]; then
# Install the nvidia package (using gcc-7)
chroot ${ROOTFS_DIR} /bin/bash -c "CC=/usr/bin/gcc-7 $filepath_host --accept-license --silent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these options (with the default directory) compatible with deviceplugins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the deviceplugins this is the Google approach (alternative to the Nvidia one) to prepare the node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch does not take the deviceplugin approach, but it also does not preclude it. It does not install nvidia-docker, and it does not swap the default container runtime. Those could be added on top if we wanted to implement the deviceplugins over the continuation of the existing method. That could be a pull request on top.

The linked PR is interesting because it installs drivers via a daemonset. If one didn't mind running containers on Kubernetes in privileged mode, it would be an interesting alternative to kops hooks. It's also nice because one could deploy via helm chart rather than editing kops instancegroup manifests.

I did take a look at the Google setup script, hoping to ditch what I just wrote in this PR. Unfortunately, just like this current PR, the setup instructions are cloud specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that accellerator is already deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, and a fine issue to address in a follow-on PR. The changes here are still dependencies to make deviceplugins work. Consider it a solid step in that direction. Hopefully someone can take it the last mile, in a logical follow-on PR.

Also, note that without these changes GPUs on AWS P3 and G3 instances don't work with kops today on 1.8, 1.9, or 1.10 (released 2 weeks ago).

The question we have to ask ourselves is does this PR move the ball toward the goal line. If not, then no worries.

Copy link
Contributor

@bhack bhack Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for now just to care if there is some particular directories setup needed as a parameter by the Nvidia installer that could help deviceplugins support. If you check I think that the Google solution it is trying to pass good options to the Nvidia installer.

@bhack
Copy link
Contributor

bhack commented Apr 26, 2018

/cc @RenaudWasTaken Any feedback?

@RenaudWasTaken
Copy link

Also adding @flx42, will comment as soon as I have some time :)

@rrtaylor
Copy link

I'm testing this new setup using P2 instances with HorizontalPodAutoscaler and cluster-autoscaler to test dynamically scaling GPU nodes. I'm seeing that when a new instance is initialized, the container I'm using to run a GPU process (tensorflow serving) starts before the setup hook finishes running and does not use the GPU (it does not fail, it just uses CPUs). Is there a way to stop pods from running until the setup hook finishes? Or should this problem be handled via adding to the container CMD a script that waits or fails until the GPU is available?

@dcwangmit01
Copy link
Contributor Author

@richardbrks Thanks for testing out the PR. I hope it works for you.

Regarding your question:

Is there a way to stop pods from running until the setup hook finishes?

Yes, there is a way. Be sure you are setting a gpu-limit in your pod spec.

    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1

Do a "kubectl describe nodes" and look under Capacity for any node. If you have set things up correctly you should see for nodes that don't have GPU and for nodes that have GPU but has not had the hook finish running, the following capacity.

alpha.kubernetes.io/nvidia-gpu: 0

At the end of the hook run the kubelet is restarted. Only at this point does the Capacity get updated to:

alpha.kubernetes.io/nvidia-gpu: 1

At this point, only if you have set the nvidia-gpu limit in your pod specification, will any pod tagged with such limit start running on the node.

I actually had the inverse problem of non-gpu pods running on the GPU machines. This was easily taken care of by taints and tolerations.

@rrtaylor
Copy link

@dcwangmit01 that worked! Thanks for your help (and for this PR)!

# AWS Instance Types to Nvidia Card Mapping (cut and pasted from AWS docs)
# Load the correct driver for the correct instance type
# Instances Product Type Product Series Product
# G2 GRID GRID Series GRID K520 <-- I think they meant G3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but I think G2 is a deprecated instance type. It no longer appears on the pricing page, though we have our own G2 instances running.

According to https://aws.amazon.com/ec2/instance-types/ and https://aws.amazon.com/ec2/instance-types/g3/, G3 instances are based on Tesla M60s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@115100 Thanks for the clarifications.

So, based on what you said:

  1. We don't have to worry about G2 instances because one cannot spin them up because of deprecation (am I wrong?)

  2. The driver package for the G3 instance is suboptimal but working (I tested it). I'm looking at the Nvidia website right now and Tesla M60 hardware resolves to the following driver, which matches that of P2 and P3.

http://us.download.nvidia.com/tesla/390.46/NVIDIA-Linux-x86_64-390.46.run

I can make the doc change and driver swap later after more feedback, and if we think this PR has any chance of merging.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G2 instances can still be launched but Nvidia's own drivers don't support kernels >=4.9 - I had to write my own patch to get the driver to work. Personally, I think it's safe to ignore G2.

On G3, I don't have any instances up to test them but the rest of the PR looks good to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvidia's own drivers don't support kernels >=4.9

That's not true. You didn't pick the right driver package then.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see a new set of drivers. It was a while back and it did take a while to support the newer kernel though.

@rohitagarwal003
Copy link
Member

alpha.kubernetes.io/nvidia-gpu

Please don't use this resource anymore. It was deprecated in 1.10 (kubernetes/kubernetes#57384) and is getting removed in 1.11 (kubernetes/kubernetes#61498).

Use device plugins that introduce nvidia.com/gpu as the resource. See the documentation: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

I actually had the inverse problem of non-gpu pods running on the GPU machines. This was easily taken care of by taints and tolerations.

Yes. Taints and Tolerations is the right approach for this. If you use device plugins and nvidia.com/gpu, you can use the ExtendedResourceToleration admission controller.

Copy link
Contributor

@chrislovecnm chrislovecnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions

@@ -12,9 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

FROM alpine:3.6
FROM debian:jessie
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the switch? We should probably use the base k8s container ... see protokube

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind I see why you switched. We should probably use the same container as protokube.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have or should have? The difference is I've already got a lot of mileage on the existing base image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're switching away from alpine generally, AIUI, so +1 to debian.

@chrislovecnm
Copy link
Contributor

/ok-to-test

/assign @rdrgmnzs @mikesplain

Can I get a review but another bash hacker?

Any comments from anyone else??

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 12, 2018
Copy link
Contributor

@mikesplain mikesplain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the bash perspective this looks like some good improvements. Good organization.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2018
@bhack
Copy link
Contributor

bhack commented May 16, 2018

Could we suggest to enable ExtendedResourceToleration in gpu.md?

@justinsb justinsb added this to the 1.10 milestone Jun 2, 2018
@KashifSaadat
Copy link
Contributor

Great work, thanks for the contribution @dcwangmit01! The comments are also very helpful in understanding the flow.

I'd suggest maybe making some amendments to the related GPU documentation in regards to your findings / quirks, deprecation of alpha.kubernetes.io/nvidia-gpu from k8s v1.10, updating the example Pod spec also as the referenced tensorflow image no longer seems to be available.

Otherwise this LGTM 👍

@bhack
Copy link
Contributor

bhack commented Jun 20, 2018

@bhack
Copy link
Contributor

bhack commented Jun 21, 2018

@bhack
Copy link
Contributor

bhack commented Jun 28, 2018

Seems that the repository matrix already support debian distributions. What do you think?

@justinsb
Copy link
Member

This LGTM, but...

I'm a little confused by @bhack 's comments. Should we merge this @dcwangmit01 , or should we switch to use the container-engine-accelerators? Or both :-) ?

@dcwangmit01
Copy link
Contributor Author

dcwangmit01 commented Jul 20, 2018

@justinsb There's nothing in this PR that precludes or is incompatible with device plugins which is required for kubernetes >= 1.11.0, which kops does not officially support yet. What's in this PR is better than what currently exists.

I'd like to see it merged, of course. It's not a big deal. I'm working on the device plugin version as we speak, and it uses the same code. I'll have a PR in coming weeks. The question is: do we want to help people that are still using accelerators until they are forced to upgrade to device plugins in 1.11. I'd say yes.

@faheem-nadeem
Copy link

I agree with the above comments to merge the PR. I have been using this PR for some time to setup gpu instances, in our k8s 1.9 cluster for ML workloads. Everything checks out nicely :)

@bhack
Copy link
Contributor

bhack commented Jul 20, 2018

I think if it is ready it is better then the current kops gpu status. Then we can use Nividia container-engine-accelerators with another PR like it is trying to introduce Kubespray at kubernetes-sigs/kubespray#2913

@justinsb
Copy link
Member

Seems there is consensus to merge :-) Thanks for clarifying, and thank you for the PR @dcwangmit01

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcwangmit01, justinsb, mikesplain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2018
@bhack
Copy link
Contributor

bhack commented Jul 20, 2018

Can we just add a reference to ExtendedResourceToleration as discussed in this thread?

@mikesplain

This comment has been minimized.

@mikesplain
Copy link
Contributor

vpc limits

/retest

@justinsb
Copy link
Member

I just cleared out the cruft that was causing us to hit the VPC limits, so this should be getting better...

/retest

@justinsb
Copy link
Member

/retest

@dcwangmit01
Copy link
Contributor Author

Thanks for monitoring the tests @justinsb and @mikesplain. I've been watching as well.

What's the process for updating the docker image in the kopeio repository? Is it human, or the build system? I haven't updated the readme because the image is sitting in my public dockerhub. It could be re-tagged and pushed, with the doc subsequently updated.

@bhack The usage of ExtendedResourceToleration is an optimization/advanced usage that isn't needed to get GPUs working. I'll leave it be. Feel free to follow up with a PR.

@k8s-ci-robot k8s-ci-robot merged commit 19b81f0 into kubernetes:master Jul 21, 2018
@erez-rabih
Copy link

erez-rabih commented Jul 23, 2018

Hi
I tried taking the script and running it on a k8s 1.7.5 cluster using kops 1.7.1
The installation failed because of a mismatch between gcc versions of the kernel and the one used for nvidia drivers compilation
This is the syslog
syslog.txt
This is the nvidia-installer log
nvidia-installer.txt

Any idea how to fix that?

@dcwangmit01
Copy link
Contributor Author

dcwangmit01 commented Jul 23, 2018

Hi @erez-rabih,

You must be using a kops OS image where the kernel is built with the same gcc version as the distribution wants to install. That's how things should be. However, the default kops images that I've seen in the stable channel manifest all have their kernels built with GCC 7.3.0, whilst the default OS installation packages for gcc are GCC 4.9.2. The installation scripts assume a default kops image with kernel compiled with GCC 7.3.0, as specified in the stable channel manifest, and thus the gcc-7 is force-installed and then Nvidia drivers are forced to use gcc-7.

This hook will not work anywhere where the kernel has not been compiled with GCC 7.3.0. Perhaps you are using an older OS image build. This will not work on debian stretch images as well, where the kernel and gcc both are gcc-6. Try upgrading to one of the current stable images in the stable channel manifest. Do a kops edit cluster, set the image, and then rolling update.

This morning I spun up a few different kops images from the stable channel manifest to check kernel and gcc versions. I've pasted the output below. You should choose from one of the images.

-dave

# Jessie image where kernel is compiled with different version than GCC 7.3.0 != 4.9.2
image: k8s-1.7-debian-jessie-amd64-hvm-ebs-2018-03-11
$ cat /proc/version
Linux version 4.4.121-k8s (root@65861083f005) (gcc version 7.3.0 (Debian 7.3.0-10) ) #1 SMP Sun Mar 11 19:39:47 UTC 2018
$ gcc --version
gcc (Debian 4.9.2-10+deb8u1) 4.9.2

# Jessie image where kernel is compiled with different version than GCC 7.3.0 != 4.9.2
image: k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
$ cat /proc/version
Linux version 4.4.121-k8s (root@65861083f005) (gcc version 7.3.0 (Debian 7.3.0-10) ) #1 SMP Sun Mar 11 19:39:47 UTC 2018
$ gcc --version
gcc (Debian 4.9.2-10+deb8u1) 4.9.2

# Jessie image where kernel is compiled with different version than GCC 7.3.0 != 4.9.2
image: k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-03-11
$ cat /proc/version
Linux version 4.4.121-k8s (root@65861083f005) (gcc version 7.3.0 (Debian 7.3.0-10) ) #1 SMP Sun Mar 11 19:39:47 UTC 2018
$ gcc --version
gcc (Debian 4.9.2-10+deb8u1) 4.9.2

# STRETCH image where kernel is compiled with same version than GCC 6.3.0 == 6.30
image: k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
$ cat /proc/version
Linux version 4.9.0-6-amd64 ([email protected]) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07)
$ gcc --version
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

@erez-rabih
Copy link

@dcwangmit01 Thanks for your detailed answer, unfortunately it didn't help
I changed the image to kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2018-03-11 as suggested

$ cat /proc/version
Linux version 4.4.121-k8s (root@65861083f005) (gcc version 7.3.0 (Debian 7.3.0-10) ) #1 SMP Sun Mar 11 19:39:47 UTC 2018
$ gcc --version
gcc (Debian 4.9.2-10+deb8u1) 4.9.2

and the hook is set to run the image dcwangmit01/aws-nvidia-bootstrap:0.1.1
The instance type is p2.xlarge on us-west-2
Still I'm getting this error:

Verifying sha1sum of file at /rootfs/nvidia-bootstrap-cache/NVIDIA-Linux-x86_64-390.46.run
/rootfs/nvidia-bootstrap-cache/NVIDIA-Linux-x86_64-390.46.run: OK
Installing file NVIDIA-Linux-x86_64-390.46.run on host
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 390.46........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most
       frequently when this kernel module was built against the wrong or
       improperly configured kernel sources, with a version of gcc that
       differs from the one used to build the target kernel, or if a driver
       such as rivafb, nvidiafb, or nouveau is present and prevents the
       NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
       device(s), or no NVIDIA GPU installed in this system is supported by
       this NVIDIA Linux graphics driver release.
       
       Please see the log entries 'Kernel module load error' and 'Kernel
       messages' at the end of the file '/var/log/nvidia-installer.log' for
       more information.


ERROR: Installation has failed.  Please see the file
       '/var/log/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

and on nvidia-installer log:

    LD [M]  /tmp/selfgz22/NVIDIA-Linux-x86_64-390.46/kernel/nvidia.ko
   make[1]: Leaving directory '/usr/src/linux-headers-4.4.121-k8s'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: Exec format error
-> Kernel messages:
[   24.253863] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[  159.891274] [drm] Module unloaded
[  160.029583] wmi: Mapper unloaded
[  208.822809] ipmi message handler version 39.2
[  208.832189] nvidia: loading out-of-tree module taints kernel.
[  208.832196] nvidia: module license 'NVIDIA' taints kernel.
[  208.832198] Disabling lock debugging due to kernel taint
[  208.838213] module: nvidia: Unknown rela relocation: 4
[  249.828118] Netfilter messages via NETLINK v0.30.
[  249.835848] ctnetlink v0.93: registering with nfnetlink.
[  264.576925] ipmi message handler version 39.2
[  264.592577] module: nvidia: Unknown rela relocation: 4
[  268.951956] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[  268.978813] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  268.984780] device vethdf46a165 entered promiscuous mode
[  268.988535] cni0: port 1(vethdf46a165) entered forwarding state
[  268.992624] cni0: port 1(vethdf46a165) entered forwarding state
[  269.457955] IPv6: eth0: IPv6 duplicate address fe80::580f:e1ff:febe:c8e7 detected!
[  284.001947] cni0: port 1(vethdf46a165) entered forwarding state
[  316.677382] ipmi message handler version 39.2
[  316.692481] module: nvidia: Unknown rela relocation: 4
[  375.059282] ipmi message handler version 39.2
[  375.074478] module: nvidia: Unknown rela relocation: 4
[  433.346601] ipmi message handler version 39.2
[  433.361618] module: nvidia: Unknown rela relocation: 4
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

@dcwangmit01
Copy link
Contributor Author

@erez-rabih Try the 1.10 jessie image. That's the one I've used.

Also try the new PR in legacy mode here: #5502

@erez-rabih
Copy link

@dcwangmit01 there is no jessie 1.10
The latest jessie image on the latest channel is kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11

@dcwangmit01
Copy link
Contributor Author

@erez-rabih Let's move this conversation into an issue, if it needs to continue. Here's how to find images.

$ aws ec2 describe-images --owners 383156758163| grep ImageLocation|grep 1.10-debian-jessie|grep 5-27
            "ImageLocation": "383156758163/k8s-1.10-debian-jessie-amd64-hvm-ebs-2018-05-27",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.