Cherry-pick #635 to release-1.1 branch (#637)

Robert Bailey · artemvmin · web-flow · commit e7b191a313fc · 2024-04-30T21:30:40.000Z
Update RAG to use Autopilot by default (#635) Remove DNS troubleshooting information, as this has been patched. Co-authored-by: artemvmin <artemvmin@google.com>
diff --git a/applications/rag/README.md b/applications/rag/README.md
@@ -31,20 +31,17 @@ Install the following on your computer:
 
 ### Bring your own cluster (optional)
 
-By default, this tutorial creates a Standard cluster on your behalf. We highly recommend following the default settings.
+By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.
 
 If you prefer to manage your own cluster, set `create_cluster = false` in the [Installation section](#installation). Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.
 
-Use the provided infrastructue module to create a cluster:
-
-1. `cd ai-on-gke/infrastructure`
-
-2. Edit `platform.tfvars` to set your project ID, location and cluster name. The other fields are optional. Ensure you create an L4 nodepool as this tutorial requires it.
-
-3. Run `terraform init`
-
-4. Run `terraform apply --var-file workloads.tfvars`
+Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features, available on the latest versions of 1.28 and 1.29.
 
+```
+gcloud container clusters create-auto rag-cluster \
+  --location us-central1 \
+  --cluster-version 1.28
+```
 ### Bring your own VPC (optional)
 
 By default, this tutorial creates a new network on your behalf with [Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) already enabled. We highly recommend following the default settings.
@@ -64,10 +61,11 @@ This section sets up the RAG infrastructure in your GCP project using Terraform.
 1. `cd ai-on-gke/applications/rag`
 
 2. Edit `workloads.tfvars` to set your project ID, location, cluster name, and GCS bucket name. Ensure the `gcs_bucket` name is globally unique (add a random suffix). Optionally, make the following changes:
-    * (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
     * (Recommended) [Enable authenticated access](#configure-authenticated-access-via-iap) for JupyterHub, frontend chat and Ray dashboard services.
-    * (Not recommended) Set `create_cluster = false` if you bring your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled.
-    * (Not recommended) Set `create_network = false` if you bring your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
+    * (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
+    * (Optional) Set `autopilot_cluster = false` to deploy using GKE Standard.
+    * (Optional) Set `create_cluster = false` if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions in [`infrastructure/README.md`](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md).
+    * (Optional) Set `create_network = false` if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
 
 3. Run `terraform init`
 
@@ -193,17 +191,13 @@ Connect to the GKE cluster:
 gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
 ```
 
-1. Troubleshoot JupyterHub job failures:
-    - If the JupyterHub job fails to start the proxy with error code 599, it is likely an known issue with Cloud DNS, which occurs when a cluster is quickly deleted and recreated with the same name.
-    - Recreate the cluster with a different name or wait several minutes after running `terraform destroy` before running `terraform apply`.
-
-2. Troubleshoot Ray job failures: 
+1. Troubleshoot Ray job failures:
     - If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
         - Run `kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay`. There should be a Ray head and Ray worker pod in `Running` state. If your ray pods aren't running, it's likely due to quota or stockout issues. Check that your project and selected `cluster_location` have L4 GPU capacity.
     - Often, retrying the Ray job submission (the last cell of the notebook) helps.
     - The Ray job may take 15-20 minutes to run the first time due to environment setup.
 
-3. Troubleshoot IAP login issues:
+2. Troubleshoot IAP login issues:
     - Verify the cert is Active:
         - For JupyterHub `kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'`
         - For the frontend: `kubectl get managedcertificates frontend-managed-cert -n rag --output jsonpath='{.status.domainStatus[0].status}'`
@@ -213,14 +207,14 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L
     - Org error:
         - The [OAuth Consent Screen](https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent) has `User type` set to `Internal` by default, which means principals external to the org your project is in cannot log in. To add external principals, change `User type` to `External`.
 
-4. Troubleshoot `terraform apply` failures:
+3. Troubleshoot `terraform apply` failures:
     - Inference server (`mistral`) fails to deploy:
         - This usually indicates a stockout/quota issue. Verify your project and chosen `cluster_location` have L4 capacity.
     - GCS bucket already exists:
         - GCS bucket names have to be globally unique, pick a different name with a random suffix.
     - Cloud SQL instance already exists:
         - Ensure the `cloudsql_instance` name doesn't already exist in your project.
 
-5. Troubleshoot `terraform destroy` failures:
+4. Troubleshoot `terraform destroy` failures:
     - Network deletion issue:
         - `terraform destroy` fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
diff --git a/applications/rag/metadata.yaml b/applications/rag/metadata.yaml
@@ -28,8 +28,8 @@ spec:
         varType: string
         defaultValue: "created-by=gke-ai-quick-start-solutions,ai.gke.io=rag"
       - name: autopilot_cluster
-        varType: string
-        defaultValue: false
+        varType: bool
+        defaultValue: true
       - name: iap_consent_info
         description: Configure the <a href="https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent"><i>OAuth Consent Screen</i></a> for your project. Ensure <b>User type</b> is set to <i>Internal</i>. Note that by default, only users within your organization can be allowlisted. To add external users, change the <b>User type</b> to <i>External</i> after the application is deployed.
         varType: bool
diff --git a/applications/rag/variables.tf b/applications/rag/variables.tf
@@ -319,7 +319,7 @@ variable "private_cluster" {
 
 variable "autopilot_cluster" {
   type    = bool
-  default = false
+  default = true
 }
 
 variable "cloudsql_instance" {
diff --git a/applications/rag/workloads.tfvars b/applications/rag/workloads.tfvars
@@ -20,7 +20,7 @@ subnetwork_cidr = "10.100.0.0/16"
 create_cluster    = true # Creates a GKE cluster in the specified network.
 cluster_name      = "<cluster-name>"
 cluster_location  = "us-central1"
-autopilot_cluster = false
+autopilot_cluster = true
 private_cluster   = false
 
 ## GKE environment variables

Original file line number	Diff line number	Diff line change
`@@ -319,7 +319,7 @@ variable "private_cluster" {`
`319`	`319`
`320`	`320`	`variable "autopilot_cluster" {`
`321`	`321`	`type = bool`
`322`		`- default = false`
	`322`	`+ default = true`
`323`	`323`	`}`
`324`	`324`
`325`	`325`	`variable "cloudsql_instance" {`