AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

When following this guide, https://www.kubeflow.org/docs/components/tfserving_new/, I am unable to serve a model using `ks param set ${MODEL_COMPONENT} numGpus 1`. Doing so results in an error `0/1 nodes are available: 1 Insufficient nvidia.com/gpu.`, which presumably means that the `nvidia.com/gpu` plugin has not been deployed. I am at a loss as to exactly how this should be done. Documentation on the Nvidia website is quite scant, and also the link provided in the guide for a GPU example (https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md) offers no explanation whatsoever. 

As a side note, if I leave out `ks param set ${MODEL_COMPONENT} numGpus 1` (or set numGpus to 0), it also doesn't work, resulting in:

```
Error: failed to start container "testmodel": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"setenv: invalid argument\"": unknown
```

### EDIT

The solution to this is as follows:
1. When creating the cluster, nodeGroups of type p3.2xlarge must be created. This will automatically create instances using the "EKS Optimized with GPU" AMI, as described here: https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html

For example:

```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: mycluster
  region: us-east-1
  version: '1.12'
availabilityZones: ["us-east-1a", "us-east-1b"]

nodeGroups:
  - name: cpu-nodegroup
    instanceType: m5.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 2
    volumeSize: 30
  - name: gpu-nodegroup
    instanceType: p3.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 10
    volumeSize: 50
    availabilityZones: ["us-east-1a"]
    iam:
      withAddonPolicies:
        autoScaler: true
    labels:
      'k8s.amazonaws.com/accelerator': 'nvidia-tesla-v100'
```

2. Thereafter, the `nvidia/gpu` daemonset must be deployed, as follows:
`kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml`

I think it is really necessary that the guide describes these requirements


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

EDIT

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

Description

EDIT

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions