Updated READMEs (#130)

arueth · web-flow · commit 6c8499f570ed · 2025-04-02T12:44:31.000-07:00
diff --git a/use-cases/inferencing/cost-optimization/README.md b/use-cases/inferencing/cost-optimization/README.md
@@ -1,54 +1,51 @@
 # Inference Cost Optimization
 
-Running inference for large language models can be expensive. Often, this cost
-goes even higher based on your requirements. For example, if your requirement is
-to reduce the latency in starting up inference, it may need advanced
-accelerators on high end virtual machines with expansive storage options. Often,
-there is confusion in choosing the right accelerator , virtual machine and
-storage options when if comes to running the large language models. The goal of
-this guide is to provide the cost efficient and high performant ways to run
-inference for Llama models.
+Running inference for large language models (LLMs) can be expensive and costs
+can increase due to specific requirements. For example, reducing inference
+startup latency may require advanced accelerators on high-end virtual machines
+with extensive storage options. Choosing the right combination of accelerator,
+virtual machine, and storage options for running large language models can be
+complicated. The goal of this guide is to provide cost-efficient and
+high-performance methods for running Llama model inference.
 
 ## Choosing accelerator, machine and storage
 
-Google Cloud provides different types of accelerators, GPUs(L4, A100, H100) and
-TPUs and storage options(GCS, Parallelstore, Hyperdisk ML, Persistent Disk) that
-covers end to end requirements for running large language models in cost
-efficient fashion.
-
-Typically, the type and the number of accelerators that will be used in your
-inference is decided based on the size of your model. For example, if you want
-to run llama 70B model which has weights of about 132GB, you will need at least
-8 nvidia-l4 or 4 nvidia-tesla-a100 or 2 nvidia-a100-80gb to run inference.
-However, you can use additional accelerators to achieve faster model load time
-and serving e.g use 8 nvidia-tesla-a100 instead of 4.
-
-Different VM types provide different number of GPUs, GPU memory, vCPU, VM memory
-and network bandwidth. You can choose the VM once you have decided what and how
-many accelerators you want to use to run inference. For example, if you decide
-to use 8 nvidia-l4 GPUs to serve llama 70B model, you can use g2-standard-96
-which is the only machine type that provides 8 nvidia-l4 in G2 machine series.
-You can configuration of different machine types at [Google Cloud GPU Machine
-types documentation][gpus]
-
-The storage option is a key factor in determining the cost and performance of
-your inference along with the accelerator. This is because the model is loaded
-from storage into GPU memory to run inference. The more throughput a storage
-provides, the faster will be the model load time and shorter will be the
-inference start up time and Time To First Token. On Google Cloud, the storage
-can be zonal, regional or multi-regional. This means if you use a zonal storage
-and decide to run inference workload in 3 different zones, you will need three
-instance of the storage in each of those zones. Therefore, it becomes critical
-to choose the storage option wisely to optimize the cost.
+Google Cloud offers a variety of accelerators, including
+[Graphics Processing Units (GPUs)](https://cloud.google.com/gpu) and
+[Tensor Processing Units (TPUs)](https://cloud.google.com/tpu) model, as well as
+[storage options](https://cloud.google.com/products/storage) such as Google
+Cloud Storage (GCS), Parallelstore, Hyperdisk ML, and Persistent Disk. These
+comprehensive options enable cost-effective operation of large language models
+for various requirements.
+
+The number and type of accelerators needed for inference is typically determined
+by the size of your model. For instance, running a
+[Llama](https://www.llama.com/) 70B model, which has weights of roughly 132GB,
+requires a minimum of eight NVIDIA L4 GPUs, four NVIDIA A100 40GB GPUs, or two
+NVIDIA A100 80GB GPUs. However, using additional accelerators, such as using
+eight NVIDIA A100 40GB instead of four, can result in faster model loading times
+and improved inference.
+
+The number of GPUs, amount of GPU memory, vCPU, memory, and network bandwidth
+are all factors that vary across different virtual machine types. After deciding
+on the type and quantity of accelerators required for inference, you can select
+the appropriate VM family and type. For example, if you need eight NVIDIA L4
+GPUs to serve the Llama 70B model, in the G2 instances the `g2-standard-96`
+machine type is the only one that can accommodate this requirement. For an
+overview of the different GPU VMs that are available on Compute Engine, see the
+[GPU Machine types](https://cloud.google.com/compute/docs/gpus) documentation.
+
+The storage and accelerator that you choose are key factors that affect the cost
+and performance of your inference. To run inference, the model must be loaded
+from storage into GPU memory. Thus, storage throughput affects model load time,
+inference start-up time, and time to first token (TTFT). Google Cloud storage
+can be zonal, regional, or multi-regional. If you use zonal storage and run
+inference workloads in three different zones, you will need three instances of
+storage, one in each of the zones. Choosing the right storage option is critical
+for cost optimization.
 
 ## Storage optimization
 
-In the [GCS storage optimization][gcs-storage-optimization] guide, we
-demonstrate how you can fine tune GCS to achieve the best performance with lower
-cost. T
-
----
-
-[gpus]: https://cloud.google.com/compute/docs/gpus
-[gcs-storage-optimization]:
-  /use-cases/inferencing/cost-optimization/gcsfuse/README.md
+In the
+[GCS storage optimization](/use-cases/inferencing/cost-optimization/gcsfuse/README.md)
+guide, we demonstrate how you can tune GCS to achieve the best cost performance.
diff --git a/use-cases/inferencing/cost-optimization/gcsfuse/README.md b/use-cases/inferencing/cost-optimization/gcsfuse/README.md
@@ -1,17 +1,18 @@
-# Use GCS to store model and GCSFuse to download
+# GCS storage optimization
 
-In this guide, you will run inference of llama 70B model twice. In the first
-run, the model will be stored in a flat GCS bucket and you will use GCSFuse
-without any fine tuning to download the model from the bucket and start
-inference. In the second run, the model will be stored in a hierarchical GCS
-bucket and you will fine-tune GCSFuse configurations to download the model from
-the bucket and start inference.
+This guide will demonstrate two different methods to deploy the Llama 70B model
+for inference using
+[Cloud Storage FUSE](https://cloud.google.com/storage/docs/cloud-storage-fuse/overview)
+via the Google Kubernetes Engine (GKE) Cloud Storage FUSE CSI driver. The first
+method uses a standard, "flat" GCS bucket to store the model. The second method
+utilizes a
+[hierarchical namespace](https://cloud.google.com/storage/docs/hns-overview) GCS
+bucket and tuned configuration.
 
 > Note : By default, a GCS bucket is created as flat.
 
-The goal of this guide is to demonstrate performance improvement in the model
-load time and pod startup time when using fine-tuned configurations with
-GCSFuse.
+This guide aims to show how to improve model load time and pod startup time by
+using tuned configurations with Cloud Storage FUSE.
 
 ## Prerequisites
 
@@ -41,10 +42,10 @@ GCSFuse.
 - Ensure that your `MLP_ENVIRONMENT_FILE` is configured.
 
   ```sh
-  set -a
+  set -o allexport
   cat ${MLP_ENVIRONMENT_FILE} && \
   source ${MLP_ENVIRONMENT_FILE}
-  set +a
+  set +o allexport
   ```
 
   > You should see the various variables populated with the information specific
@@ -126,7 +127,7 @@ GCSFuse.
 
   ```
   NAME                              READY   UP-TO-DATE   AVAILABLE    AGE
-  vllm-openai-gcs-llama33-70b-a100   1/1     1            1           XXXXX
+  vllm-openai-gcs-llama33-70b-a100  1/1     1            1           XXXXX
   ```
 
 ## Calculate pod startup time
@@ -140,7 +141,7 @@ GCSFuse.
 
   ENDING_MODEL_DOWNLOAD_TIME=`kubectl --namespace ${MLP_MODEL_SERVE_NAMESPACE} logs ${POD_NAME} -c "inference-server" | grep "^INFO.*Loading model weights took" | head -n 1 | awk '{print $2" "$3}' | xargs -I {} date -d "$(date +%Y)-{}" +%s%3N`
 
-  MODEL_LOAD_TIME_WITHOUT_TUNING=$(( (ENDING_MODEL_DOWNLOAD_TIME - BEGIN_MODEL_DOWNLOAD_TIME)/1000 ))
+  MODEL_LOAD_TIME_WITHOUT_TUNING=$(((ENDING_MODEL_DOWNLOAD_TIME - BEGIN_MODEL_DOWNLOAD_TIME)/1000 ))
 
   echo "MODEL LOAD TIME WITHOUT TUNING - ${MODEL_LOAD_TIME_WITHOUT_TUNING}s"
   ```
@@ -156,7 +157,7 @@ GCSFuse.
 
   POD_READY_TIME=`kubectl --namespace "${MLP_MODEL_SERVE_NAMESPACE}" get pods "$POD_NAME" -o json | jq -r '.status.conditions[] | select(.type == "Ready") | .lastTransitionTime' | date -f - +%s%3N`
 
-  POD_STARTUP_TIME_WITHOUT_TUNING=$(( (POD_READY_TIME - POD_SCHEDULED_TIME)/1000 ))
+  POD_STARTUP_TIME_WITHOUT_TUNING=$(((POD_READY_TIME - POD_SCHEDULED_TIME)/1000 ))
 
   echo "POD STARTUP TIME WITHOUT TUNING - ${POD_STARTUP_TIME_WITHOUT_TUNING}s"
   ```
@@ -225,7 +226,7 @@ GCSFuse.
 
   ENDING_MODEL_DOWNLOAD_TIME_TUNED=`kubectl --namespace ${MLP_MODEL_SERVE_NAMESPACE} logs ${POD_NAME_TUNED} -c "inference-server" | grep "^INFO.*Loading model weights took" | head -n 1 | awk '{print $2" "$3}' | xargs -I {} date -d "$(date +%Y)-{}" +%s%3N`
 
-  MODEL_LOAD_TIME_WITH_TUNING=$(( (ENDING_MODEL_DOWNLOAD_TIME_TUNED - BEGIN_MODEL_DOWNLOAD_TIME_TUNED)/1000 ))
+  MODEL_LOAD_TIME_WITH_TUNING=$(((ENDING_MODEL_DOWNLOAD_TIME_TUNED - BEGIN_MODEL_DOWNLOAD_TIME_TUNED)/1000 ))
 
   echo "MODEL LOAD TIME WITH TUNING - ${MODEL_LOAD_TIME_WITH_TUNING}s"
   ```
@@ -241,7 +242,7 @@ GCSFuse.
 
   POD_READY_TIME_TUNED=`kubectl --namespace "${MLP_MODEL_SERVE_NAMESPACE}" get pods "$POD_NAME_TUNED" -o json | jq -r '.status.conditions[] | select(.type == "Ready") | .lastTransitionTime' | date -f - +%s%3N`
 
-  POD_STARTUP_TIME_WITH_TUNING=$(( (POD_READY_TIME_TUNED - POD_SCHEDULED_TIME_TUNED)/1000 ))
+  POD_STARTUP_TIME_WITH_TUNING=$(((POD_READY_TIME_TUNED - POD_SCHEDULED_TIME_TUNED)/1000 ))
 
   echo "POD STARTUP TIME WITH TUNING - ${POD_STARTUP_TIME_WITH_TUNING}s"
   ```
@@ -257,7 +258,7 @@ echo $POD_STARTUP_TIME_WITHOUT_TUNING
 echo $POD_STARTUP_TIME_WITH_TUNING
 ```
 
-The pod startup time without fine-tuning will be around 20 minutes and with
-fine-tuning, it will be around 3 mins. GCSFuse can facilitate faster download of
-the model weights and reduces the time to startup inference server. You will see
-good improvements with large models which have weights over several GBs.
+The pod startup time is approximately 20 minutes, but with tuning it can be
+reduced to around 3 minutes. GCSFuse can further decrease startup time by
+enabling faster downloads of model weights, which is especially beneficial for
+large models with weights exceeding several GBs.