Skip to content

Commit 83edc3e

Browse files
author
Robert Koehlmoos
authored
Merge pull request #66 from GPS-Solutions/main
syncing aica to css
2 parents efb5002 + 65233ec commit 83edc3e

File tree

3 files changed

+35
-4
lines changed

3 files changed

+35
-4
lines changed

components/llm_service/src/config/models.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@
153153
"temperature": 0.2,
154154
"top_p": 0.95,
155155
"top_k": 40,
156-
"max_length": 2048
156+
"max_tokens": 2048
157157
}
158158
},
159159
"VertexAI-ModelGarden-LLAMA2-Chat": {

experimental/vllm_gemma/README.md

+31
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,38 @@
11
# Deploying Gemma 2B
2+
Reference: https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm
23

34
## Pre-Requisites
45
Kubernetes cluster with L4 GPUs nodepool
6+
```shell
7+
export CLUSTER_NAME="main-cluster"
8+
export REGION="us-central1"
9+
gcloud container node-pools create gpu-node-pool \
10+
--accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
11+
--project=${PROJECT_ID} \
12+
--location=${REGION} \
13+
--node-locations=${REGION}-a \
14+
--cluster=${CLUSTER_NAME} \
15+
--service-account gke-sa@${PROJECT_ID}.iam.gserviceaccount.com \
16+
--machine-type=g2-standard-24 \
17+
--disk-type pd-balanced \
18+
--disk-size 100 \
19+
--num-nodes=1
20+
21+
gcloud container node-pools list --region=${REGION} --cluster=${CLUSTER_NAME}
22+
```
23+
24+
25+
## HuggingFace API Token
26+
```shell
27+
export HF_TOKEN=...
28+
```
29+
Create secret:
30+
```shell
31+
kubectl create secret generic hf-secret \
32+
--from-literal=hf_api_token=$HF_TOKEN \
33+
--dry-run=client -o yaml | kubectl apply -f -
34+
kubectl describe secret hf-secret
35+
```
536

637
## Deployment
738
Deploy Gemma 2B LLM using `kubectl`

experimental/vllm_gemma/vllm-gemma-2b-it.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@ spec:
1111
metadata:
1212
labels:
1313
app: gemma-server
14-
ai.gke.io/model: gemma-2b-it
14+
ai.gke.io/model: gemma-1.1-2b-it
1515
ai.gke.io/inference-server: vllm
1616
examples.ai.gke.io/source: user-guide
1717
spec:
1818
containers:
1919
- name: inference-server
20-
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
20+
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240527_0916_RC00
2121
resources:
2222
requests:
2323
cpu: "2"
@@ -61,5 +61,5 @@ spec:
6161
type: ClusterIP
6262
ports:
6363
- protocol: TCP
64-
port: 8000
64+
port: 80
6565
targetPort: 8000

0 commit comments

Comments
 (0)