Add quantization support for maxtext models and update conversion scripts (#1006)

vivianrwu · web-flow · commit 34e2131a53f3 · 2025-03-03T11:58:39.000-08:00
* Support quantization for llama3.3-70b and llama3.1-405b on CPUs

* add quantization support for maxtext models and update conversion scripts
diff --git a/tutorials-and-examples/inference-servers/checkpoints/Dockerfile b/tutorials-and-examples/inference-servers/checkpoints/Dockerfile
@@ -1,3 +1,17 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Ubuntu:22.04
 # Use Ubuntu 22.04 from Docker Hub.
 # https://hub.docker.com/_/ubuntu/tags?page=1&name=22.04
@@ -22,10 +36,16 @@ RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyri
 RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
 RUN apt -y update && apt install -y google-cloud-cli
 
+RUN git clone https://github.com/AI-Hypercomputer/maxtext.git && \
+cd /maxtext && \
+bash setup.sh
+
 RUN pip install kaggle && \
 pip install huggingface_hub[cli] && \
 pip install google-jetstream && \
-pip install llama-toolchain
+pip install llama-stack && \
+pip install torch && \
+pip install grain-nightly==0.0.10
 
 COPY checkpoint_converter.sh /usr/bin/
 RUN chmod +x /usr/bin/checkpoint_converter.sh
diff --git a/tutorials-and-examples/inference-servers/checkpoints/README.md b/tutorials-and-examples/inference-servers/checkpoints/README.md
@@ -5,43 +5,36 @@ The `checkpoint_entrypoint.sh` script overviews how to convert your inference ch
 Build the checkpoint conversion Dockerfile
 ```
 docker build -t inference-checkpoint .
-docker tag inference-checkpoint gcr.io/${PROJECT_ID}/inference-checkpoint:latest
-docker push gcr.io/${PROJECT_ID}/inference-checkpoint:latest
+docker tag inference-checkpoint ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/jetstream/inference-checkpoint:latest
+docker push ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/jetstream/inference-checkpoint:latest
 ```
 
 Now you can use it in a [Kubernetes job](../jetstream/maxtext/single-host-inference/checkpoint-job.yaml) and pass the following arguments
 
 ## Jetstream + MaxText
 ```
-- -s=INFERENCE_SERVER
-- -b=BUCKET_NAME
-- -m=MODEL_PATH
-- -v=VERSION (Optional)
+-b, --bucket_name: [string] The GSBucket name to store checkpoints, without gs://.
+-s, --inference_server: [string] The name of the inference server that serves your model. (Optional) (default=jetstream-maxtext)
+-m, --model_path: [string] The model path.
+-n, --model_name: [string] The model name.
+-h, --huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
+-t, --quantize_type: [string] The type of quantization. (Optional)
+-q, --quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
+-i, --input_directory: [string] The input directory, likely a GSBucket path.
+-o, --output_directory: [string] The output directory, likely a GSBucket path.
+-u, --meta_url: [string] The url from Meta. (Optional)
+-v, --version: [string] The version of repository. (Optional) (default=main)
 ```
 
 ## Jetstream + Pytorch/XLA
 ```
-- -s=INFERENCE_SERVER
-- -m=MODEL_PATH
-- -n=MODEL_NAME
-- -q=QUANTIZE_WEIGHTS (Optional) (default=False)
-- -t=QUANTIZE_TYPE (Optional) (default=int8_per_channel)
-- -v=VERSION (Optional) (default=jetstream-v0.2.3)
-- -i=INPUT_DIRECTORY (Optional)
-- -o=OUTPUT_DIRECTORY
-- -h=HUGGINGFACE (Optional) (default=False)
-```
-
-## Argument descriptions:
-```
-b) BUCKET_NAME: (str) GSBucket, without gs://
-s) INFERENCE_SERVER: (str) Inference server, ex. jetstream-maxtext, jetstream-pytorch
-m) MODEL_PATH: (str) Model path, varies depending on inference server and location of base checkpoint
-n) MODEL_NAME: (str) Model name, ex. llama-2, llama-3, gemma
-h) HUGGINGFACE: (bool) Checkpoint is from HuggingFace.
-q) QUANTIZE_WEIGHTS: (str) Whether to quantize weights
-t) QUANTIZE_TYPE: (str) Quantization type, QUANTIZE_WEIGHTS must be set to true. Availabe quantize type: {"int8", "int4"} x {"per_channel", "blockwise"},
-v) VERSION: (str) Version of inference server to override, ex. jetstream-v0.2.2, jetstream-v0.2.3
-i) INPUT_DIRECTORY: (str) Input checkpoint directory, likely a GSBucket path
-o) OUTPUT_DIRECTORY: (str) Output checkpoint directory, likely a GSBucket path
+- -s, --inference_server: [string] The name of the inference server that serves your model.
+- -m, --model_path: [string] The model path.
+- -n, --model_name: [string] The model name, Model name, ex. llama-2, llama-3, gemma.
+- -q, --quantize_weights: [bool] The checkpoint is to be quantized. (Optional) (default=False)
+- -t, --quantize_type: [string] The type of quantization. Availabe quantize type: {"int8", "int4"} x {"per_channel", "blockwise"}. (Optional) (default=int8_per_channel)
+- -v, --version: [string] The version of repository to override, ex. jetstream-v0.2.2, jetstream-v0.2.3. (Optional) (default=main)
+- -i, --input_directory: [string] The input directory, likely a GSBucket path. (Optional)
+- -o, --output_directory: [string] The output directory, likely a GSBucket path.
+- -h, --huggingface: [bool] The model is from Hugging Face. (Optional) (default=False)
 ```
diff --git a/tutorials-and-examples/inference-servers/checkpoints/checkpoint_converter.sh b/tutorials-and-examples/inference-servers/checkpoints/checkpoint_converter.sh