checkpoint conversion edits for guide and flag for increased byte i/o (#1029)

vivianrwu · web-flow · commit d645ad54898f · 2025-03-13T13:13:50.000-07:00
* add concurrent_gb flag to support i/o of large models

* add checkpoint conversion instructions externally
diff --git a/tutorials-and-examples/inference-servers/checkpoints/README.md b/tutorials-and-examples/inference-servers/checkpoints/README.md
@@ -13,28 +13,28 @@ Now you can use it in a [Kubernetes job](../jetstream/maxtext/single-host-infere
 
 ## Jetstream + MaxText
 ```
--b, --bucket_name: [string] The GSBucket name to store checkpoints, without gs://.
--s, --inference_server: [string] The name of the inference server that serves your model. (Optional) (default=jetstream-maxtext)
--m, --model_path: [string] The model path.
--n, --model_name: [string] The model name.
--h, --huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
--t, --quantize_type: [string] The type of quantization. (Optional)
--q, --quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
--i, --input_directory: [string] The input directory, likely a GSBucket path.
--o, --output_directory: [string] The output directory, likely a GSBucket path.
--u, --meta_url: [string] The url from Meta. (Optional)
--v, --version: [string] The version of repository. (Optional) (default=main)
+--bucket_name: [string] The GSBucket name to store checkpoints, without gs://.
+--inference_server: [string] The name of the inference server that serves your model. (Optional) (default=jetstream-maxtext)
+--model_path: [string] The model path.
+--model_name: [string] The model name. ex. llama-2, llama-3, gemma.
+--huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
+--quantize_type: [string] The type of quantization. (Optional)
+--quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
+--input_directory: [string] The input directory, likely a GSBucket path.
+--output_directory: [string] The output directory, likely a GSBucket path.
+--meta_url: [string] The url from Meta. (Optional)
+--version: [string] The version of repository. (Optional) (default=main)
 ```
 
 ## Jetstream + Pytorch/XLA
 ```
-- -s, --inference_server: [string] The name of the inference server that serves your model.
-- -m, --model_path: [string] The model path.
-- -n, --model_name: [string] The model name, Model name, ex. llama-2, llama-3, gemma.
-- -q, --quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
-- -t, --quantize_type: [string] The type of quantization. Availabe quantize type: {"int8", "int4"} x {"per_channel", "blockwise"}. (Optional) (default=int8_per_channel)
-- -v, --version: [string] The version of repository to override, ex. jetstream-v0.2.2, jetstream-v0.2.3. (Optional) (default=main)
-- -i, --input_directory: [string] The input directory, likely a GSBucket path. (Optional)
-- -o, --output_directory: [string] The output directory, likely a GSBucket path.
-- -h, --huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
+--inference_server: [string] The name of the inference server that serves your model.
+--model_path: [string] The model path.
+--model_name: [string] The model name. ex. llama-2, llama-3, gemma.
+--quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
+--quantize_type: [string] The type of quantization. Availabe quantize type: {"int8", "int4"} x {"per_channel", "blockwise"}. (Optional) (default=int8_per_channel)
+--version: [string] The version of repository to override, ex. jetstream-v0.2.2, jetstream-v0.2.3. (Optional) (default=main)
+--input_directory: [string] The input directory, likely a GSBucket path. (Optional)
+--output_directory: [string] The output directory, likely a GSBucket path.
+--huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
 ```
diff --git a/tutorials-and-examples/inference-servers/checkpoints/checkpoint_converter.sh b/tutorials-and-examples/inference-servers/checkpoints/checkpoint_converter.sh
@@ -128,6 +128,8 @@ convert_maxtext_checkpoint() {
     QUANTIZE_TYPE=$8
     QUANTIZE_WEIGHTS=$9
 
+    CONCURRENT_GB=96
+
     echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Bucket name=${BUCKET_NAME}"
     echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Model path=${MODEL_PATH}"
     echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Model name=${MODEL_NAME}"
@@ -210,6 +212,7 @@ convert_maxtext_checkpoint() {
                 MODEL_SIZE="llama3.1-70b"
             elif [[ $MODEL_PATH == *"405B"* ]] || [[ $MODEL_PATH == *"405b"* ]]; then
                 MODEL_SIZE="llama3.1-405b"
+                CONCURRENT_GB=500
             else
                 echo -e "Unclear llama3.1 model: $MODEL_PATH"
             fi
@@ -237,6 +240,7 @@ convert_maxtext_checkpoint() {
         echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Maxtext model path=${OUTPUT_CKPT_DIR_SCANNED}"
         echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Model path=${MODEL_PATH}"
         echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Model size=${MODEL_SIZE}"
+        echo -e "$(date '+%Y-%m-%d %H:%M:%S'): Concurrent_gb=${CONCURRENT_GB}"
 
         export JAX_PLATFORMS=cpu
         cd /maxtext/
@@ -260,6 +264,7 @@ convert_maxtext_checkpoint() {
         run_name=${RUN_NAME} \
         model_name=${MODEL_SIZE} \
         force_unroll=true \
+        checkpoint_storage_concurrent_gb=${CONCURRENT_GB} \
         weight_dtype=bfloat16 \
         opt_type=sgd
 
diff --git a/tutorials-and-examples/inference-servers/checkpoints/examples/llama3.1-405b/README.md b/tutorials-and-examples/inference-servers/checkpoints/examples/llama3.1-405b/README.md
@@ -0,0 +1,110 @@
+# Checkpoint conversion with Llama3.1-405B
+
+This example will walk through converting a Llama3.1-405b from Meta to a MaxText compatible checkpoint for inference.
+
+The Llama3-405B model needs ~2000 GB of client memory to download and run checkpoint conversion. This machine type and boot disk size will ensure enough capacity to facilitate the model download and conversion. The checkpoint conversion for the 405B model supports weights downloaded only from Meta.
+
+## Agree to the Meta Terms and Conditions
+Go to https://www.llama.com/llama-downloads/ to acknowledge the terms and conditions. Select `Llama3.1: 405B & 8B` as the model(s) you will request.
+
+Copy the provided Meta URL to use in your manifest file.
+
+## Create GCS Bucket to store checkpoint
+```
+BUCKET_NAME=<your bucket>
+
+gcloud storage buckets create gs://$BUCKET_NAME
+```
+
+## Configure a service account for Storage Object access
+**This step can be skipped if already done on the cluster with a different Service Account.**
+
+Configure a Kubernetes service account to act as an IAM service account.
+
+Create an IAM service account for your application:
+
+```
+gcloud iam service-accounts create checkpoint-converter
+```
+
+Add an IAM policy binding for your IAM service account to manage Cloud Storage:
+
+```
+gcloud projects add-iam-policy-binding ${PROJECT} \
+  --member "serviceAccount:checkpoint-converter@${PROJECT}.iam.gserviceaccount.com" \
+  --role roles/storage.objectUser
+
+gcloud projects add-iam-policy-binding ${PROJECT} \
+  --member "serviceAccount:checkpoint-converter@${PROJECT}.iam.gserviceaccount.com" \
+  --role roles/storage.insightsCollectorService
+```
+
+Annotate the Kubernetes service account with the email address of the IAM service account. 
+
+```
+kubectl annotate serviceaccount default \
+iam.gke.io/gcp-service-account=checkpoint-converter@${PROJECT}.iam.gserviceaccount.com
+```
+
+## Provision resources to facilitate conversion
+Create a node pool with machine type `m1-ultramem-160`. You may need to request m1 quota in your project.
+
+```
+CLUSTER=<your cluster>
+ZONE=<your zone>
+PROJECT=<project>
+
+gcloud container node-pools create m1-pool \
+--cluster "${CLUSTER}" \
+--zone "${ZONE}" \
+--machine-type m1-ultramem-160 \
+--num-nodes 1 \
+--disk-size 3000 \
+--project "${PROJECT}" \
+--scopes cloud-platform
+```
+
+In `checkpoint-converter.yaml`, replace `BUCKET_NAME` with the GCS Bucket that you created earlier, without gs://.
+
+Parameter descriptions:
+
+```
+--bucket_name: [string] The GSBucket name to store checkpoints, without gs://.
+--inference_server: [string] The name of the inference server that serves your model. (ex. jetstream-maxtext)
+--meta_url: [string] The url from Meta.
+--model_name: [string] The model name. (ex. llama-2, llama-3, llama-3.1)
+--model_path: [string] The model path. For Llama models, download llama via `pip install llama-stack` and run `llama model list --show-all` for Model Descriptor to use. (ex. Llama3.1-405B-Instruct:bf16-mp16)
+--output_directory: [string] The output directory. (ex. gs://bucket_name/maxtext/llama3.1-405b)
+--quantize_type: [string] The type of quantization. (ex. int8)
+--quantize_weights: [bool] The checkpoint is to be quantized. (ex. True)
+```
+
+For a bf16 checkpoint only, remove the flags `--quantize_type` and `--quantize_weights`.
+
+Apply the manifest:
+
+```
+kubectl apply -f checkpoint-converter.yaml
+```
+
+The checkpoint conversion job takes around 9-10 hours to complete, To monitor the progress of the checkpoint download and conversion, check [GCP Log Explorer](https://console.cloud.google.com/logs/query) and enter the following query:
+
+```
+resource.type="k8s_container"
+resource.labels.project_id="PROJECT_ID"
+resource.labels.location="LOCATION"
+resource.labels.cluster_name="CLUSTER_NAME"
+resource.labels.namespace_name="default"
+resource.labels.pod_name:"checkpoint-converter-"
+```
+
+Once completed, you will see a log similar to:
+
+```
+# bf16 checkpoint
+Completed unscanning checkpoint to gs://output_directory/bf16/unscanned/checkpoints/0/items
+
+# int8 checkpoint
+Completed quantizing model llama3.1-405b with int8 to gs://output_directory/int8
+```
+
diff --git a/tutorials-and-examples/inference-servers/checkpoints/examples/llama3.1-405b/checkpoint-converter.yaml b/tutorials-and-examples/inference-servers/checkpoints/examples/llama3.1-405b/checkpoint-converter.yaml
@@ -0,0 +1,37 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: checkpoint-converter
+spec:
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: inference-checkpoint
+        image: us-docker.pkg.dev/cloud-tpu-images/inference/inference-checkpoint:v0.2.5
+        imagePullPolicy: Always
+        args:
+        - --bucket_name=BUCKET_NAME
+        - --inference_server=jetstream-maxtext
+        - --meta_url=META_URL
+        - --model_name=llama-3.1
+        - --model_path=Llama3.1-405B-Instruct:bf16-mp16
+        - --output_directory=gs://BUCKET_NAME/maxtext/llama-3.1-405b
+        - --quantize_type=int8
+        - --quantize_weights=True
+      nodeSelector:
+        cloud.google.com/gke-nodepool: m1-pool