Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU Webhook] Fix KubeRay headless worker svc truncation bug #963

Merged

Conversation

ryanaoleary
Copy link
Collaborator

This PR fixes a bug where the KubeRay RayCluster controller's truncation of generated service names (here) resulted in incorrectly generated TPU_WORKER_HOSTNAMES. This bug was discovered because RayJobs and RayServices with names > 13 chars result in long generated RayCluster names and a truncated TPU headless worker service name. This PR applies the same truncation when generating the name within the webhook so that the names remain consistent.

Related Issues:
ray-project/kuberay#2923

Testing Process:

  • Unit tests - added unit test coverage
  • Manual tests

@ryanaoleary
Copy link
Collaborator Author

Manual testing process:

  • created a RayCluster with the following spec:
# This template contains a Kuberay cluster using a 2x2x1 TPU v4 PodSlice.
# To get access to TPU resources, please follow instructions in this link:
# https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  # Label required for TPU webhook to initialize environments.
  labels:
    app.kubernetes.io/name: kuberay
  name: extremely-long-test-raycluster-name
spec:
  headGroupSpec:
    rayStartParams:
      {}
    template:
      spec:
        imagePullSecrets:
          []
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            name: ray-head
            image: rayproject/ray:2.9.0-py310
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "8"
                ephemeral-storage: 20Gi
                memory: 40G
              requests:
                cpu: "8"
                ephemeral-storage: 10Gi
                memory: 40G
            securityContext:
              {}
            env:
              - name: RAY_memory_monitor_refresh_ms
                value: "0"
              - name: RAY_GRAFANA_IFRAME_HOST
                value: http://${grafana_host}
              - name: RAY_GRAFANA_HOST
                value: http://grafana:80
              - name: RAY_PROMETHEUS_HOST
                value: http://frontend:9090
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
        volumes:
          - emptyDir: {}
            name: ray-logs
      metadata:
        labels:
          cloud.google.com/gke-ray-node-type: head
          app.kubernetes.io/name: kuberay
          app.kubernetes.io/instance: example-cluster

  workerGroupSpecs:
  - rayStartParams:
      {}
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 2
    groupName: workergroup
    template:
      spec:
        imagePullSecrets:
          []
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            name: ray-worker
            image: rayproject/ray:2.9.0-py310
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "1"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 40G
              requests:
                cpu: "1"
                ephemeral-storage: 10Gi
                google.com/tpu: "4"
                memory: 40G
            securityContext:
              {}
            env:
            ports:
              null
        volumes:
          - emptyDir: {}
            name: ray-logs
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x2
      metadata:
        labels:
          cloud.google.com/gke-ray-node-type: worker
          app.kubernetes.io/name: kuberay
          app.kubernetes.io/instance: example-cluster

Checked headless service name created for multi-host TPU group:

(myenv) ryanaoleary@ryanaoleary1:~/Desktop/forks/kuberay/ray-operator/config/samples$ kubectl get svc
NAME                                                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                         AGE
extremely-long-test-raycluster-name-head-svc         ClusterIP   None           <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   61s
kuberay-operator                                     ClusterIP   34.118.226.3   <none>        8080/TCP                                        7d2h
kubernetes                                           ClusterIP   34.118.224.1   <none>        443/TCP                                         26d
mely-long-test-raycluster-name-headless-worker-svc   ClusterIP   None           <none>        <none>                                          61s

Checked created Pods:

(myenv) ryanaoleary@ryanaoleary1:~/Desktop/forks/kuberay/ray-operator/config/samples$ kubectl get pods
NAME                                                           READY   STATUS    RESTARTS   AGE
extremely-long-test-raycluster-name-head-7jcqq                 0/1     Pending   0          47s
extremely-long-test-raycluster-name-workergroup-worker-cr5vh   0/1     Pending   0          46s
extremely-long-test-raycluster-name-workergroup-worker-kh7fs   0/1     Pending   0          47s
kuberay-operator-77d8944bfc-dfssn                              1/1     Running   0          5d8h

Verified TPU_WORKER_HOSTNAMES matches service name:

kubectl describe pods extremely-long-test-raycluster-name-workergroup-worker-kh7fs
...
      TPU_WORKER_HOSTNAMES:                 workergroup-0-0.mely-long-test-raycluster-name-headless-worker-svc,workergroup-0-1.mely-long-test-raycluster-name-headless-worker-svc
      TPU_WORKER_ID:                        0
      TPU_NAME:                             workergroup-0
...

@ryanaoleary ryanaoleary enabled auto-merge (squash) February 6, 2025 01:14
@ryanaoleary
Copy link
Collaborator Author

cc: @andrewsykim

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary merged commit c44682d into GoogleCloudPlatform:main Feb 6, 2025
7 checks passed
ArthurKamalov pushed a commit to volatilemolotov/ai-on-gke that referenced this pull request Feb 18, 2025
ArthurKamalov pushed a commit to volatilemolotov/ai-on-gke that referenced this pull request Feb 18, 2025
ArthurKamalov pushed a commit to volatilemolotov/ai-on-gke that referenced this pull request Feb 18, 2025
ArthurKamalov pushed a commit to volatilemolotov/ai-on-gke that referenced this pull request Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants