Skip to content

Commit d4985d4

Browse files
committed
Upgrade ray version; shrink worker resource allocation
1 parent 5b980da commit d4985d4

File tree

3 files changed

+25
-28
lines changed

3 files changed

+25
-28
lines changed

applications/rag/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RAG-on-GKE Application
22

3-
**NOTE:** This solution is in beta/a work in progress - please expect friction while using it.
3+
**NOTE:** This solution is in beta. Please expect friction while using it.
44

55
This is a sample to deploy a RAG application on GKE. Retrieval Augmented Generation (RAG) is a popular approach for boosting the accuracy of LLM responses, particularly for domain specific or private data sets. The basic idea is to have a semantically searchable knowledge base (often using vector search), which is used to retrieve relevant snippets for a given prompt to provide additional context to the LLM. Augmenting the knowledge base with additional data is typically cheaper than fine tuning and is more scalable when incorporating current events and other rapidly changing data spaces.
66

@@ -32,7 +32,7 @@ CLUSTER_REGION=us-central1
3232
```
3333
2. Use the following instructions to create a GKE cluster. We recommend using Autopilot for a simpler setup.
3434

35-
##### Autopilot
35+
##### Autopilot (recommended)
3636

3737
RAG requires the latest Autopilot features, available on GKE cluster version `1.29.1-gke.1575000`+
3838
```
@@ -46,7 +46,7 @@ gcloud container clusters create-auto ${CLUSTER_NAME:?} \
4646
--cluster-version ${CLUSTER_VERSION:?}
4747
```
4848

49-
##### Standard (recommended)
49+
##### Standard
5050

5151
1. To create a GKE Standard cluster using Terraform, follow the [instructions here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md). Use the preconfigured node pools in `/infrastructure/platform.tfvars` as this solution requires T4s and L4s.
5252

@@ -105,6 +105,7 @@ gcloud container clusters get-credentials ${CLUSTER_NAME:?} --location ${CLUSTER
105105
```
106106
kubectl port-forward -n ${NAMESPACE:?} deployment/mistral-7b-instruct 8080:8080
107107
```
108+
108109
* In a new terminal, try a few prompts:
109110
```
110111
export USER_PROMPT="How to deploy a container on K8s?"
@@ -119,6 +120,7 @@ curl 127.0.0.1:8080/generate -X POST \
119120
}
120121
EOF
121122
```
123+
122124
* At the end of the smoke test with the TGI server, stop port forwarding by using Ctrl-C on the original terminal.
123125

124126
5. Verify the frontend chat interface is setup:
@@ -167,8 +169,8 @@ This step generates the vector embeddings for your input dataset. Currently, the
167169
* `os.environ['KAGGLE_KEY']`
168170
169171
9. Run all the cells in the notebook. This will generate vector embeddings for the input dataset (`denizbilginn/google-maps-restaurant-reviews`) and store them in the `pgvector-instance` via a Ray job.
170-
* Once submitted, Ray will take several minutes to create the runtime environment and optionally scale up Ray worker nodes. During this time, the job status will remain PENDING.
171-
* When the job status is SUCCEEDED, the vector embeddings have been generated and we are ready to launch the frontend chat interface.
172+
* If the Ray job has FAILED, re-run the cell.
173+
* When the Ray job has SUCCEEDED, we are ready to launch the frontend chat interface.
172174
173175
### Launch the Frontend Chat Interface
174176

applications/rag/example_notebooks/rag-kaggle-ray-sql-latest.ipynb

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,7 @@
252252
"id": "7ba6c3ff-a25a-4f4d-b58e-68f7fe7d33df",
253253
"metadata": {},
254254
"outputs": [],
255-
"source": [
255+
"source": [
256256
"job_id = client.submit_job(\n",
257257
" entrypoint=\"python test.py\",\n",
258258
" # Path to the local directory that contains the entrypoint file.\n",
@@ -278,10 +278,9 @@
278278
" status = client.get_job_status(job_id)\n",
279279
" if status != prev_status:\n",
280280
" print(\"Job status:\", status)\n",
281+
" print(\"Job info:\", client.get_job_info(job_id).message)\n",
281282
" prev_status = status\n",
282283
" if status.is_terminal():\n",
283-
" if status == 'FAILED':\n",
284-
" print(\"Job info:\", client.get_job_info(job_id))\n",
285284
" break\n",
286285
" time.sleep(5)\n"
287286
]

modules/kuberay-cluster/kuberay-autopilot-values.yaml

Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2023 Google LLC
1+
# Copyright 2024 Google LLC
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -22,7 +22,7 @@
2222
image:
2323
# Replace this with your own image if needed.
2424
repository: rayproject/ray
25-
tag: 2.6.1-py310-gpu
25+
tag: 2.7.1-py310-gpu
2626
pullPolicy: IfNotPresent
2727

2828
nameOverride: "kuberay"
@@ -64,8 +64,6 @@ head:
6464
# containerEnv specifies environment variables for the Ray container,
6565
# Follows standard K8s container env schema.
6666
containerEnv:
67-
# - name: EXAMPLE_ENV
68-
# value: "1"
6967
- name: RAY_memory_monitor_refresh_ms
7068
value: "0"
7169
- name: RAY_GRAFANA_IFRAME_HOST
@@ -90,18 +88,18 @@ head:
9088
# for further guidance.
9189
resources:
9290
limits:
93-
cpu: "8"
91+
cpu: "1"
9492
# To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
95-
memory: "20G"
93+
memory: "8G"
9694
ephemeral-storage: 20Gi
9795
requests:
98-
cpu: "8"
99-
memory: "20G"
96+
cpu: "1"
97+
memory: "8G"
10098
ephemeral-storage: 20Gi
10199
annotations:
102100
gke-gcsfuse/volumes: "true"
103-
gke-gcsfuse/cpu-limit: "2"
104-
gke-gcsfuse/memory-limit: 20Gi
101+
gke-gcsfuse/cpu-limit: "1"
102+
gke-gcsfuse/memory-limit: 4Gi
105103
gke-gcsfuse/ephemeral-storage-limit: 20Gi
106104
nodeSelector:
107105
cloud.google.com/compute-class: "Performance"
@@ -158,8 +156,6 @@ worker:
158156
disabled: true
159157

160158
# The map's key is used as the groupName.
161-
# For example, key:small-group in the map below
162-
# will be used as the groupName
163159
additionalWorkerGroups:
164160
cpuGroup:
165161
# Disabled by default
@@ -194,16 +190,16 @@ additionalWorkerGroups:
194190
resources:
195191
limits:
196192
cpu: 4
197-
memory: "20G"
193+
memory: "16G"
198194
ephemeral-storage: 20Gi
199195
requests:
200196
cpu: 4
201-
memory: "20G"
197+
memory: "16G"
202198
ephemeral-storage: 20Gi
203199
annotations:
204200
gke-gcsfuse/volumes: "true"
205201
gke-gcsfuse/cpu-limit: "2"
206-
gke-gcsfuse/memory-limit: 20Gi
202+
gke-gcsfuse/memory-limit: 8Gi
207203
gke-gcsfuse/ephemeral-storage-limit: 20Gi
208204
nodeSelector:
209205
cloud.google.com/compute-class: "Performance"
@@ -287,19 +283,19 @@ additionalWorkerGroups:
287283
# for further guidance.
288284
resources:
289285
limits:
290-
cpu: "8"
286+
cpu: "4"
291287
nvidia.com/gpu: "2"
292-
memory: "40G"
288+
memory: "16G"
293289
ephemeral-storage: 20Gi
294290
requests:
295-
cpu: "8"
291+
cpu: "4"
296292
nvidia.com/gpu: "2"
297-
memory: "40G"
293+
memory: "16G"
298294
ephemeral-storage: 20Gi
299295
annotations:
300296
gke-gcsfuse/volumes: "true"
301297
gke-gcsfuse/cpu-limit: "2"
302-
gke-gcsfuse/memory-limit: 20Gi
298+
gke-gcsfuse/memory-limit: 8Gi
303299
gke-gcsfuse/ephemeral-storage-limit: 20Gi
304300
nodeSelector:
305301
cloud.google.com/compute-class: "Accelerator"

0 commit comments

Comments
 (0)