|
| 1 | +# Online inference with GPUs on Google Cloud |
| 2 | + |
| 3 | +This reference architecture implements online inferencing using GPUs on Google |
| 4 | +Cloud. This reference architecture builds on top of the |
| 5 | +[Inference Platform reference implementation](/platforms/gke/base/use-cases/inference-ref-arch/terraform/README.md). |
| 6 | + |
| 7 | +## Best practices for online inferencing on Google Cloud |
| 8 | + |
| 9 | +### Accelerator selection |
| 10 | + |
| 11 | +### Storage solution selection |
| 12 | + |
| 13 | +### Model selection |
| 14 | + |
| 15 | +### Observability |
| 16 | + |
| 17 | +### Scalability |
| 18 | + |
| 19 | +### Cost optimization |
| 20 | + |
| 21 | +## Architecture |
| 22 | + |
| 23 | +## Deploy the reference architecture |
| 24 | + |
| 25 | +This reference architecture builds on top of the infrastructure that the |
| 26 | +[Inference Platform reference implementation](/platforms/gke/base/use-cases/inference-ref-arch/terraform/README.md) |
| 27 | +provides, and follows the best practices that the reference implementations |
| 28 | +establishes. |
| 29 | + |
| 30 | +Before deploying the reference architecture described in this document, you |
| 31 | +deploy one instance of the Inference Platform reference implementation. The |
| 32 | +reference architecture supports deploying multiple instances of the reference |
| 33 | +architecture in the same project. To deploy the reference architecture, you do |
| 34 | +the following: |
| 35 | + |
| 36 | +1. To enable deploying resources for the online inference reference |
| 37 | + architecture, initialize the following configuration variables in |
| 38 | + `platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/inference-ref-arch.auto.tfvars`: |
| 39 | + |
| 40 | + ```hcl |
| 41 | + ira_use_case_flavor = "ira-online-gpu" |
| 42 | + ``` |
| 43 | +
|
| 44 | +1. Deploy an instance of the Inference Platform reference implementation. For |
| 45 | + more information about how to deploy an instance of the reference |
| 46 | + architecture, see |
| 47 | + [Inference Platform reference implementation](/platforms/gke/base/use-cases/inference-ref-arch/terraform/README.md) |
| 48 | +
|
| 49 | + After you deploy the reference architecture instances, continue following |
| 50 | + this document. |
| 51 | +
|
| 52 | +## Download the model to Cloud Storage |
| 53 | +
|
| 54 | +1. Take note of the name of the Cloud Storage bucket where the model will be |
| 55 | + downloaded: |
| 56 | +
|
| 57 | + ```shell |
| 58 | + terraform -chdir="${ACP_PLATFORM_USE_CASE_DIR}/terraform/cloud_storage" init \ |
| 59 | + && terraform -chdir="${ACP_PLATFORM_USE_CASE_DIR}/terraform/cloud_storage" output -json ira_google_storage_bucket_names |
| 60 | + ``` |
| 61 | +
|
| 62 | + The output might contain multiple bucket names. The name of the bucket where |
| 63 | + the model will be downloaded ends with the `ira-model` suffix. |
| 64 | +
|
| 65 | +1. Initialize the configuration variables to set the name of the Cloud Storage |
| 66 | + bucket where the model will be downloaded in |
| 67 | + `platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/model-download.env`: |
| 68 | +
|
| 69 | + ```shell |
| 70 | + IRA_BUCKET_NAME=<IRA_BUCKET_NAME> |
| 71 | + MODEL_ID=<MODEL_ID> |
| 72 | + ``` |
| 73 | +
|
| 74 | + Where: |
| 75 | +
|
| 76 | + - `<IRA_BUCKET_NAME>` is the name of the Cloud Storage bucket where the |
| 77 | + model will be downloaded. |
| 78 | + - `MODEL_ID>` is the fully qualified model identifier. |
| 79 | +
|
| 80 | + - For Gemma, the fully qualified model identifier is: |
| 81 | + `google/gemma-3-27b-it` |
| 82 | + - For Llama 4, the fully qualified model identifier is: |
| 83 | + `meta-llama/Llama-4-Scout-17B-16E-Instruct` |
| 84 | + - For Llama 3.3, the fully qualified model identifier is: |
| 85 | + `meta-llama/Llama-3.3-70B-Instruct` |
| 86 | +
|
| 87 | +1. [Generate a Hugging Face token](https://huggingface.co/docs/hub/security-tokens). |
| 88 | + Make sure to grant the |
| 89 | + `Read access to contents of all public gated repos you can access` |
| 90 | + permission to the Hugging Face token. |
| 91 | +
|
| 92 | +1. Store the Hugging Face token in |
| 93 | + `platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/hugging-face-token.env`: |
| 94 | +
|
| 95 | + ```shell |
| 96 | + HUGGING_FACE_TOKEN=<HUGGING_FACE_TOKEN> |
| 97 | + ``` |
| 98 | +
|
| 99 | + Where: |
| 100 | +
|
| 101 | + - `<HUGGING_FACE_TOKEN>` is the Hugging Face token. |
| 102 | +
|
| 103 | + If the |
| 104 | + `platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/hugging-face-token.env` |
| 105 | + file doesn't exist, create it. |
| 106 | +
|
| 107 | +1. Get access to the model by signing the consent agreement: |
| 108 | +
|
| 109 | + - For Gemma: |
| 110 | +
|
| 111 | + 1. Access the |
| 112 | + [model consent page on Kaggle.com](https://www.kaggle.com/models/google/gemma). |
| 113 | +
|
| 114 | + 1. Verify consent using your Hugging Face account. |
| 115 | +
|
| 116 | + 1. Accept the model terms. |
| 117 | +
|
| 118 | + - For Llama: |
| 119 | +
|
| 120 | + 1. Accept the model terms on Hugging Face |
| 121 | +
|
| 122 | +1. Deploy the model downloader in the GKE cluster: |
| 123 | +
|
| 124 | + ```shell |
| 125 | + kubectl apply -k platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download |
| 126 | + ``` |
| 127 | +
|
| 128 | +1. Wait for the model downloader to download the model: |
| 129 | +
|
| 130 | + ```shell |
| 131 | + watch --color --interval 5 --no-title \ |
| 132 | + "kubectl get job/transfer-model-to-gcs | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'" |
| 133 | + ``` |
| 134 | +
|
| 135 | + The output is similar to the following: |
| 136 | +
|
| 137 | + ```text |
| 138 | + NAME STATUS COMPLETIONS DURATION AGE |
| 139 | + transfer-model-to-gcs Complete 1/1 33m 3h30m |
| 140 | + ``` |
| 141 | +
|
| 142 | +### Roles and permissions |
| 143 | +
|
| 144 | +### Next steps |
| 145 | +
|
| 146 | +### Destroy the reference architecture |
0 commit comments