Data Preparation

A processed Flipkart product catalog is used as input data to generate prompts in preparation for fine-tuning. The prompts are generated using Llama 3.1 on Vertex AI. The output is a data set that can be used to fine-tune the base model.

Depending on the infrastructure you provisioned, the data preparation step takes approximately 1 hour and 40 minutes.

Prerequisites

This guide was developed to be run on the playground AI/ML platform. If you are using a different environment the scripts and manifest will need to be modified for that environment.
A bucket containing the processed data from the Data Processing example

NOTE: If you did not execute the data processing example, follow these instructions to load the processed data into the bucket.

Preparation

Accept Llama 3.1 on Vertex AI license agreement terms
```
echo -e "\nhttps://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-3.1-405b-instruct-maas\n"
```
1. Accept the license terms for the Llama 3.1 model
2. On the Llama 3.1 on Vertex AI model card, click the blue ENABLE button

Clone the repository and change directory to the guide directory

git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \
cd accelerated-platforms/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it

Ensure that your MLP_ENVIRONMENT_FILE is configured
```
cat ${MLP_ENVIRONMENT_FILE} && \
source ${MLP_ENVIRONMENT_FILE}
```
You should see the various variables populated with the information specific to your environment.

Vertex AI OpenAI endpoint variables

Set VERTEX_REGION to Google Cloud region to use for the Vertex AI API OpenAI endpoint calls
```
VERTEX_REGION=us-central1
```
The Llama 3.1 on Vertex API is in preview, it is only available in us-central1

Build the container image

Build the container image using Cloud Build and push the image to Artifact Registry

cd src
sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
gcloud beta builds submit \
--config cloudbuild.yaml \
--gcs-source-staging-dir gs://${MLP_CLOUDBUILD_BUCKET}/source \
--project ${MLP_PROJECT_ID} \
--substitutions _DESTINATION=${MLP_DATA_PREPARATION_IMAGE}
cd ..

Run the job

Get credentials for the GKE cluster

gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}

Configure the job

Variable	Description	Example
DATASET_INPUT_PATH	The folder path of where the preprocessed flipkart data resides	flipkart_preprocessed_dataset
DATASET_INPUT_FILE	The filename of the preprocessed flipkart data	flipkart.csv
DATASET_OUTPUT_PATH	The folder path of where the generated output data set will reside. This path will be needed for fine-tuning.	dataset/output
PROMPT_MODEL_ID	The Vertex AI model for prompt generation	meta/llama-3.1-70b-instruct-maas

DATASET_INPUT_PATH="flipkart_preprocessed_dataset"
DATASET_INPUT_FILE="flipkart.csv"
DATASET_OUTPUT_PATH="dataset/output"
PROMPT_MODEL_ID="meta/llama-3.1-70b-instruct-maas"

sed \
-i -e "s|V_IMAGE_URL|${MLP_DATA_PREPARATION_IMAGE}|" \
-i -e "s|V_KSA|${MLP_DATA_PREPARATION_KSA}|" \
-i -e "s|V_PROJECT_ID|${MLP_PROJECT_ID}|" \
-i -e "s|V_DATA_BUCKET|${MLP_DATA_BUCKET}|" \
-i -e "s|V_DATASET_INPUT_PATH|${DATASET_INPUT_PATH}|" \
-i -e "s|V_DATASET_INPUT_FILE|${DATASET_INPUT_FILE}|" \
-i -e "s|V_DATASET_OUTPUT_PATH|${DATASET_OUTPUT_PATH}|" \
-i -e "s|V_PROMPT_MODEL_ID|${PROMPT_MODEL_ID}|" \
-i -e "s|V_REGION|${VERTEX_REGION}|" \
manifests/job.yaml

Create the job

kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job.yaml

Once the Job is completed, the prepared datasets are stored in Google Cloud Storage.
```
gcloud storage ls gs://${MLP_DATA_BUCKET}/${DATASET_OUTPUT_PATH}
```

Observability

By default, both GKE and the workloads you run expose metrics and logs in Google Cloud's Observability suite. You can view this information from the Cloud Observability console or the GKE Observability page.

For more information about infrastructure and application metrics, see View observability metrics.

You may want to perform the following tasks specifically for the data preparation use case described in this example.

Monitor the job

In the Google Cloud console, go to the Kubernetes Engine page. Under the Resource Management menu on the left side, click Workloads. From there, you can filter the workloads by cluster name and namespaces. The Observability tab provides system level metric views such as Overview, CPU, and Memory. If you click the job name like data-prep, you can see the job details like the following page:

At the bottom of the page, you can see the status of the managed pods by the job. If your job is having trouble running, the EVENTS and LOGS tabs will provide more insight. You can also adjust the time windows or open the Container logs and Audit logs for additional information.

View the logs

To gain insight into your workload quickly, you can filter and tweak the log queries to view only the relevant logs. You can do so in the Logs Explorer. One fast way to open the Logs Explorer and have the query pre-populated is to click the View in Logs Explorer button on the right side of the LOGS tab once you are on the Job details page.

When the link is opened, you should see something like the following:

The Logs Explorer provides many nice features besides tweaking your log query in the Query field. For example, if you want to know which steps the job has completed, you can run the following query based on the source code:

resource.type="k8s_container"
resource.labels.location="us-central1"
resource.labels.namespace_name="ml-team"
jsonPayload.message = (
"***Job Start***" OR
"Configure signal handlers" OR
"Prepare context for model prompt" OR
"Generate Q & A according" OR
"Generate Prompts for Gemma IT model" OR
"Upload prepared dataset into GCS" OR
"***Job End***")

As another example, if you want to know how many prompts are generated in a specific time window, you can do something like the following:

Look for the log entries from the code associated with the prompt generation. In this example, the Content generated log entry is produced each time a prompt is generated.
You can click the Similar entries, which automatically updates the log query for you and lists all Content generated entries.
Adjust the timeline in the middle of the page and zoom in/out. You will see how many log entries are ingested during a specific time window, such as 30 seconds. That number should be the same as the number of prompts generated by the code.

Log Analytics

You can also use Log Analytics to analyze your logs. After it is enabled, you can run SQL queries to gain insight from the logs. The result can also be charted. For example, you can click the Analyze results link on the Logs Explorer page and open the Log Analytics page with a converted SQL query. The chart and table you view can also be added to a dashboard.

Notes

The raw pre-crawled public dataset, license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Preparation

Prerequisites

Preparation

Vertex AI OpenAI endpoint variables

Build the container image

Run the job

Observability

Monitor the job

View the logs

Log Analytics

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Preparation

Prerequisites

Preparation

Vertex AI OpenAI endpoint variables

Build the container image

Run the job

Observability

Monitor the job

View the logs

Log Analytics

Notes