Skip to content

AWS: TensorFlow Serving: 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. #727

Closed
@karlschriek

Description

@karlschriek

When following this guide, https://www.kubeflow.org/docs/components/tfserving_new/, I am unable to serve a model using ks param set ${MODEL_COMPONENT} numGpus 1. Doing so results in an error 0/1 nodes are available: 1 Insufficient nvidia.com/gpu., which presumably means that the nvidia.com/gpu plugin has not been deployed. I am at a loss as to exactly how this should be done. Documentation on the Nvidia website is quite scant, and also the link provided in the guide for a GPU example (https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md) offers no explanation whatsoever.

As a side note, if I leave out ks param set ${MODEL_COMPONENT} numGpus 1 (or set numGpus to 0), it also doesn't work, resulting in:

Error: failed to start container "testmodel": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"setenv: invalid argument\"": unknown

EDIT

The solution to this is as follows:

  1. When creating the cluster, nodeGroups of type p3.2xlarge must be created. This will automatically create instances using the "EKS Optimized with GPU" AMI, as described here: https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html

For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: mycluster
  region: us-east-1
  version: '1.12'
availabilityZones: ["us-east-1a", "us-east-1b"]

nodeGroups:
  - name: cpu-nodegroup
    instanceType: m5.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 2
    volumeSize: 30
  - name: gpu-nodegroup
    instanceType: p3.2xlarge
    desiredCapacity: 1
    minSize: 0
    maxSize: 10
    volumeSize: 50
    availabilityZones: ["us-east-1a"]
    iam:
      withAddonPolicies:
        autoScaler: true
    labels:
      'k8s.amazonaws.com/accelerator': 'nvidia-tesla-v100'
  1. Thereafter, the nvidia/gpu daemonset must be deployed, as follows:
    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml

I think it is really necessary that the guide describes these requirements

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions