Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #635 to release-1.1 branch #637

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 15 additions & 21 deletions applications/rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,17 @@ Install the following on your computer:

### Bring your own cluster (optional)

By default, this tutorial creates a Standard cluster on your behalf. We highly recommend following the default settings.
By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.

If you prefer to manage your own cluster, set `create_cluster = false` in the [Installation section](#installation). Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.

Use the provided infrastructue module to create a cluster:

1. `cd ai-on-gke/infrastructure`

2. Edit `platform.tfvars` to set your project ID, location and cluster name. The other fields are optional. Ensure you create an L4 nodepool as this tutorial requires it.

3. Run `terraform init`

4. Run `terraform apply --var-file workloads.tfvars`
Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features, available on the latest versions of 1.28 and 1.29.

```
gcloud container clusters create-auto rag-cluster \
--location us-central1 \
--cluster-version 1.28
```
### Bring your own VPC (optional)

By default, this tutorial creates a new network on your behalf with [Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) already enabled. We highly recommend following the default settings.
Expand All @@ -64,10 +61,11 @@ This section sets up the RAG infrastructure in your GCP project using Terraform.
1. `cd ai-on-gke/applications/rag`

2. Edit `workloads.tfvars` to set your project ID, location, cluster name, and GCS bucket name. Ensure the `gcs_bucket` name is globally unique (add a random suffix). Optionally, make the following changes:
* (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
* (Recommended) [Enable authenticated access](#configure-authenticated-access-via-iap) for JupyterHub, frontend chat and Ray dashboard services.
* (Not recommended) Set `create_cluster = false` if you bring your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled.
* (Not recommended) Set `create_network = false` if you bring your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
* (Optional) Set a custom `kubernetes_namespace` where all k8s resources will be created.
* (Optional) Set `autopilot_cluster = false` to deploy using GKE Standard.
* (Optional) Set `create_cluster = false` if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions in [`infrastructure/README.md`](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md).
* (Optional) Set `create_network = false` if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.

3. Run `terraform init`

Expand Down Expand Up @@ -193,17 +191,13 @@ Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
```

1. Troubleshoot JupyterHub job failures:
- If the JupyterHub job fails to start the proxy with error code 599, it is likely an known issue with Cloud DNS, which occurs when a cluster is quickly deleted and recreated with the same name.
- Recreate the cluster with a different name or wait several minutes after running `terraform destroy` before running `terraform apply`.

2. Troubleshoot Ray job failures:
1. Troubleshoot Ray job failures:
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
- Run `kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay`. There should be a Ray head and Ray worker pod in `Running` state. If your ray pods aren't running, it's likely due to quota or stockout issues. Check that your project and selected `cluster_location` have L4 GPU capacity.
- Often, retrying the Ray job submission (the last cell of the notebook) helps.
- The Ray job may take 15-20 minutes to run the first time due to environment setup.

3. Troubleshoot IAP login issues:
2. Troubleshoot IAP login issues:
- Verify the cert is Active:
- For JupyterHub `kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'`
- For the frontend: `kubectl get managedcertificates frontend-managed-cert -n rag --output jsonpath='{.status.domainStatus[0].status}'`
Expand All @@ -213,14 +207,14 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L
- Org error:
- The [OAuth Consent Screen](https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent) has `User type` set to `Internal` by default, which means principals external to the org your project is in cannot log in. To add external principals, change `User type` to `External`.

4. Troubleshoot `terraform apply` failures:
3. Troubleshoot `terraform apply` failures:
- Inference server (`mistral`) fails to deploy:
- This usually indicates a stockout/quota issue. Verify your project and chosen `cluster_location` have L4 capacity.
- GCS bucket already exists:
- GCS bucket names have to be globally unique, pick a different name with a random suffix.
- Cloud SQL instance already exists:
- Ensure the `cloudsql_instance` name doesn't already exist in your project.

5. Troubleshoot `terraform destroy` failures:
4. Troubleshoot `terraform destroy` failures:
- Network deletion issue:
- `terraform destroy` fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
4 changes: 2 additions & 2 deletions applications/rag/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ spec:
varType: string
defaultValue: "created-by=gke-ai-quick-start-solutions,ai.gke.io=rag"
- name: autopilot_cluster
varType: string
defaultValue: false
varType: bool
defaultValue: true
- name: iap_consent_info
description: Configure the <a href="https://developers.google.com/workspace/guides/configure-oauth-consent#configure_oauth_consent"><i>OAuth Consent Screen</i></a> for your project. Ensure <b>User type</b> is set to <i>Internal</i>. Note that by default, only users within your organization can be allowlisted. To add external users, change the <b>User type</b> to <i>External</i> after the application is deployed.
varType: bool
Expand Down
2 changes: 1 addition & 1 deletion applications/rag/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ variable "private_cluster" {

variable "autopilot_cluster" {
type = bool
default = false
default = true
}

variable "cloudsql_instance" {
Expand Down
2 changes: 1 addition & 1 deletion applications/rag/workloads.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ subnetwork_cidr = "10.100.0.0/16"
create_cluster = true # Creates a GKE cluster in the specified network.
cluster_name = "<cluster-name>"
cluster_location = "us-central1"
autopilot_cluster = false
autopilot_cluster = true
private_cluster = false

## GKE environment variables
Expand Down