Skip to content

Latest commit

 

History

History
158 lines (120 loc) · 4.88 KB

File metadata and controls

158 lines (120 loc) · 4.88 KB

RAG: Database setup and initialization

This kubernetes job helps you load the flipkart product catalog to the alloyDB database named product_catalog.Also it creates separate columns to store the embeddings(text, image and multimodal) in a table named clothes in the product_catalog database.

Prerequisites

Preparation

  • Clone the repository.

    git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \
    cd accelerated-platforms
  • Change directory to the guide directory.

    cd use-cases/rag-pipeline/alloy-db-setup
  • Ensure that your MLP_ENVIRONMENT_FILE is configured.

    cat ${MLP_ENVIRONMENT_FILE} && \
    set -o allexport && \
    source ${MLP_ENVIRONMENT_FILE} && \
    set +o allexport

    You should see the various variables populated with the information specific to your environment.

  • Get credentials for the GKE cluster.

    gcloud container clusters get-credentials ${MLP_CLUSTER_NAME} \
    --dns-endpoint \
    --project=${MLP_PROJECT_ID} \
    --region=${MLP_REGION}

Build the container image

  • Build the container image using Cloud Build and push the image to Artifact Registry

    cd src
    git restore cloudbuild.yaml
    sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
    gcloud beta builds submit \
    --config cloudbuild.yaml \
    --gcs-source-staging-dir gs://${MLP_CLOUDBUILD_BUCKET}/source \
    --project ${MLP_PROJECT_ID} \
    --region ${MLP_REGION} \
    --substitutions _DESTINATION=${MLP_DB_SETUP_IMAGE}
    cd -

    It takes approximately 2 minutes for the build to complete.

Run the job

  • Configure the job

    set -o nounset
    export CATALOG_DB_NAME="product_catalog"
    export CATALOG_TABLE_NAME="clothes"
    export DB_READ_USERS="${MLP_DB_USER_IAM}"
    export DB_WRITE_USERS="${MLP_DB_USER_IAM}"
    export EMBEDDING_COLUMN_IMAGE="image_embeddings"
    export EMBEDDING_COLUMN_MULTIMODAL="multimodal_embeddings"
    export EMBEDDING_COLUMN_TEXT="text_embeddings"
    export EMBEDDING_DIMENSION="768"
    export EMBEDDING_ENDPOINT_IMAGE="http://multimodal-embedding-model.ml-team:80/image_embeddings"
    export EMBEDDING_ENDPOINT_MULTIMODAL="http://multimodal-embedding-model.ml-team:80/multimodal_embeddings"
    export EMBEDDING_ENDPOINT_TEXT="http://multimodal-embedding-model.ml-team:80/text_embeddings"
    export MASTER_CATALOG_FILE_NAME="RAG/master_product_catalog.csv"
    export NUM_LEAVES_VALUE="300"
    set +o nounset

    Ensure there are no bash: <ENVIRONMENT_VARIABLE> unbound variable error messages.

    git restore manifests/job-initialize-database.yaml manifests/job-populate-table.yaml
    envsubst < manifests/job-initialize-database.yaml | sponge manifests/job-initialize-database.yaml
    envsubst < manifests/job-populate-table.yaml | sponge manifests/job-populate-table.yaml
  • Create the initialize database job.

    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job-initialize-database.yaml

    It takes approximately 1 minute for the job to complete.

  • Watch the job until it is complete.

    watch --color --interval 5 --no-title \
    "kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} get job/initialize-database | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'
    echo '\nLogs(last 10 lines):'
    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} logs job/initialize-database --tail 10"
    NAME                  STATUS     COMPLETIONS   DURATION   AGE
    initialize-database   Complete   1/1           XXXXX      XXXXX
    
  • Check logs for any errors.

    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} logs job/initialize-database
  • Create the populate table job.

    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job-populate-table.yaml

    It takes approximately 12 minutes for the job to complete.

  • Watch the job until it is complete.

    watch --color --interval 5 --no-title \
    "kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} get job/populate-table | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'
    echo '\nLogs(last 10 lines):'
    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} logs job/populate-table --tail 10"
    NAME             STATUS     COMPLETIONS   DURATION   AGE
    populate-table   Complete   1/1           XXXXX      XXXXX
    
  • Check logs for any errors.

    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} logs job/populate-table