Skip to content

Commit 6c8499f

Browse files
authored
Updated READMEs (#130)
1 parent 0e21542 commit 6c8499f

File tree

2 files changed

+65
-67
lines changed

2 files changed

+65
-67
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,51 @@
11
# Inference Cost Optimization
22

3-
Running inference for large language models can be expensive. Often, this cost
4-
goes even higher based on your requirements. For example, if your requirement is
5-
to reduce the latency in starting up inference, it may need advanced
6-
accelerators on high end virtual machines with expansive storage options. Often,
7-
there is confusion in choosing the right accelerator , virtual machine and
8-
storage options when if comes to running the large language models. The goal of
9-
this guide is to provide the cost efficient and high performant ways to run
10-
inference for Llama models.
3+
Running inference for large language models (LLMs) can be expensive and costs
4+
can increase due to specific requirements. For example, reducing inference
5+
startup latency may require advanced accelerators on high-end virtual machines
6+
with extensive storage options. Choosing the right combination of accelerator,
7+
virtual machine, and storage options for running large language models can be
8+
complicated. The goal of this guide is to provide cost-efficient and
9+
high-performance methods for running Llama model inference.
1110

1211
## Choosing accelerator, machine and storage
1312

14-
Google Cloud provides different types of accelerators, GPUs(L4, A100, H100) and
15-
TPUs and storage options(GCS, Parallelstore, Hyperdisk ML, Persistent Disk) that
16-
covers end to end requirements for running large language models in cost
17-
efficient fashion.
18-
19-
Typically, the type and the number of accelerators that will be used in your
20-
inference is decided based on the size of your model. For example, if you want
21-
to run llama 70B model which has weights of about 132GB, you will need at least
22-
8 nvidia-l4 or 4 nvidia-tesla-a100 or 2 nvidia-a100-80gb to run inference.
23-
However, you can use additional accelerators to achieve faster model load time
24-
and serving e.g use 8 nvidia-tesla-a100 instead of 4.
25-
26-
Different VM types provide different number of GPUs, GPU memory, vCPU, VM memory
27-
and network bandwidth. You can choose the VM once you have decided what and how
28-
many accelerators you want to use to run inference. For example, if you decide
29-
to use 8 nvidia-l4 GPUs to serve llama 70B model, you can use g2-standard-96
30-
which is the only machine type that provides 8 nvidia-l4 in G2 machine series.
31-
You can configuration of different machine types at [Google Cloud GPU Machine
32-
types documentation][gpus]
33-
34-
The storage option is a key factor in determining the cost and performance of
35-
your inference along with the accelerator. This is because the model is loaded
36-
from storage into GPU memory to run inference. The more throughput a storage
37-
provides, the faster will be the model load time and shorter will be the
38-
inference start up time and Time To First Token. On Google Cloud, the storage
39-
can be zonal, regional or multi-regional. This means if you use a zonal storage
40-
and decide to run inference workload in 3 different zones, you will need three
41-
instance of the storage in each of those zones. Therefore, it becomes critical
42-
to choose the storage option wisely to optimize the cost.
13+
Google Cloud offers a variety of accelerators, including
14+
[Graphics Processing Units (GPUs)](https://cloud.google.com/gpu) and
15+
[Tensor Processing Units (TPUs)](https://cloud.google.com/tpu) model, as well as
16+
[storage options](https://cloud.google.com/products/storage) such as Google
17+
Cloud Storage (GCS), Parallelstore, Hyperdisk ML, and Persistent Disk. These
18+
comprehensive options enable cost-effective operation of large language models
19+
for various requirements.
20+
21+
The number and type of accelerators needed for inference is typically determined
22+
by the size of your model. For instance, running a
23+
[Llama](https://www.llama.com/) 70B model, which has weights of roughly 132GB,
24+
requires a minimum of eight NVIDIA L4 GPUs, four NVIDIA A100 40GB GPUs, or two
25+
NVIDIA A100 80GB GPUs. However, using additional accelerators, such as using
26+
eight NVIDIA A100 40GB instead of four, can result in faster model loading times
27+
and improved inference.
28+
29+
The number of GPUs, amount of GPU memory, vCPU, memory, and network bandwidth
30+
are all factors that vary across different virtual machine types. After deciding
31+
on the type and quantity of accelerators required for inference, you can select
32+
the appropriate VM family and type. For example, if you need eight NVIDIA L4
33+
GPUs to serve the Llama 70B model, in the G2 instances the `g2-standard-96`
34+
machine type is the only one that can accommodate this requirement. For an
35+
overview of the different GPU VMs that are available on Compute Engine, see the
36+
[GPU Machine types](https://cloud.google.com/compute/docs/gpus) documentation.
37+
38+
The storage and accelerator that you choose are key factors that affect the cost
39+
and performance of your inference. To run inference, the model must be loaded
40+
from storage into GPU memory. Thus, storage throughput affects model load time,
41+
inference start-up time, and time to first token (TTFT). Google Cloud storage
42+
can be zonal, regional, or multi-regional. If you use zonal storage and run
43+
inference workloads in three different zones, you will need three instances of
44+
storage, one in each of the zones. Choosing the right storage option is critical
45+
for cost optimization.
4346

4447
## Storage optimization
4548

46-
In the [GCS storage optimization][gcs-storage-optimization] guide, we
47-
demonstrate how you can fine tune GCS to achieve the best performance with lower
48-
cost. T
49-
50-
---
51-
52-
[gpus]: https://cloud.google.com/compute/docs/gpus
53-
[gcs-storage-optimization]:
54-
/use-cases/inferencing/cost-optimization/gcsfuse/README.md
49+
In the
50+
[GCS storage optimization](/use-cases/inferencing/cost-optimization/gcsfuse/README.md)
51+
guide, we demonstrate how you can tune GCS to achieve the best cost performance.

use-cases/inferencing/cost-optimization/gcsfuse/README.md

+22-21
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
1-
# Use GCS to store model and GCSFuse to download
1+
# GCS storage optimization
22

3-
In this guide, you will run inference of llama 70B model twice. In the first
4-
run, the model will be stored in a flat GCS bucket and you will use GCSFuse
5-
without any fine tuning to download the model from the bucket and start
6-
inference. In the second run, the model will be stored in a hierarchical GCS
7-
bucket and you will fine-tune GCSFuse configurations to download the model from
8-
the bucket and start inference.
3+
This guide will demonstrate two different methods to deploy the Llama 70B model
4+
for inference using
5+
[Cloud Storage FUSE](https://cloud.google.com/storage/docs/cloud-storage-fuse/overview)
6+
via the Google Kubernetes Engine (GKE) Cloud Storage FUSE CSI driver. The first
7+
method uses a standard, "flat" GCS bucket to store the model. The second method
8+
utilizes a
9+
[hierarchical namespace](https://cloud.google.com/storage/docs/hns-overview) GCS
10+
bucket and tuned configuration.
911

1012
> Note : By default, a GCS bucket is created as flat.
1113
12-
The goal of this guide is to demonstrate performance improvement in the model
13-
load time and pod startup time when using fine-tuned configurations with
14-
GCSFuse.
14+
This guide aims to show how to improve model load time and pod startup time by
15+
using tuned configurations with Cloud Storage FUSE.
1516

1617
## Prerequisites
1718

@@ -41,10 +42,10 @@ GCSFuse.
4142
- Ensure that your `MLP_ENVIRONMENT_FILE` is configured.
4243

4344
```sh
44-
set -a
45+
set -o allexport
4546
cat ${MLP_ENVIRONMENT_FILE} && \
4647
source ${MLP_ENVIRONMENT_FILE}
47-
set +a
48+
set +o allexport
4849
```
4950

5051
> You should see the various variables populated with the information specific
@@ -126,7 +127,7 @@ GCSFuse.
126127

127128
```
128129
NAME READY UP-TO-DATE AVAILABLE AGE
129-
vllm-openai-gcs-llama33-70b-a100 1/1 1 1 XXXXX
130+
vllm-openai-gcs-llama33-70b-a100 1/1 1 1 XXXXX
130131
```
131132

132133
## Calculate pod startup time
@@ -140,7 +141,7 @@ GCSFuse.
140141

141142
ENDING_MODEL_DOWNLOAD_TIME=`kubectl --namespace ${MLP_MODEL_SERVE_NAMESPACE} logs ${POD_NAME} -c "inference-server" | grep "^INFO.*Loading model weights took" | head -n 1 | awk '{print $2" "$3}' | xargs -I {} date -d "$(date +%Y)-{}" +%s%3N`
142143

143-
MODEL_LOAD_TIME_WITHOUT_TUNING=$(( (ENDING_MODEL_DOWNLOAD_TIME - BEGIN_MODEL_DOWNLOAD_TIME)/1000 ))
144+
MODEL_LOAD_TIME_WITHOUT_TUNING=$(((ENDING_MODEL_DOWNLOAD_TIME - BEGIN_MODEL_DOWNLOAD_TIME)/1000 ))
144145

145146
echo "MODEL LOAD TIME WITHOUT TUNING - ${MODEL_LOAD_TIME_WITHOUT_TUNING}s"
146147
```
@@ -156,7 +157,7 @@ GCSFuse.
156157

157158
POD_READY_TIME=`kubectl --namespace "${MLP_MODEL_SERVE_NAMESPACE}" get pods "$POD_NAME" -o json | jq -r '.status.conditions[] | select(.type == "Ready") | .lastTransitionTime' | date -f - +%s%3N`
158159

159-
POD_STARTUP_TIME_WITHOUT_TUNING=$(( (POD_READY_TIME - POD_SCHEDULED_TIME)/1000 ))
160+
POD_STARTUP_TIME_WITHOUT_TUNING=$(((POD_READY_TIME - POD_SCHEDULED_TIME)/1000 ))
160161

161162
echo "POD STARTUP TIME WITHOUT TUNING - ${POD_STARTUP_TIME_WITHOUT_TUNING}s"
162163
```
@@ -225,7 +226,7 @@ GCSFuse.
225226

226227
ENDING_MODEL_DOWNLOAD_TIME_TUNED=`kubectl --namespace ${MLP_MODEL_SERVE_NAMESPACE} logs ${POD_NAME_TUNED} -c "inference-server" | grep "^INFO.*Loading model weights took" | head -n 1 | awk '{print $2" "$3}' | xargs -I {} date -d "$(date +%Y)-{}" +%s%3N`
227228

228-
MODEL_LOAD_TIME_WITH_TUNING=$(( (ENDING_MODEL_DOWNLOAD_TIME_TUNED - BEGIN_MODEL_DOWNLOAD_TIME_TUNED)/1000 ))
229+
MODEL_LOAD_TIME_WITH_TUNING=$(((ENDING_MODEL_DOWNLOAD_TIME_TUNED - BEGIN_MODEL_DOWNLOAD_TIME_TUNED)/1000 ))
229230

230231
echo "MODEL LOAD TIME WITH TUNING - ${MODEL_LOAD_TIME_WITH_TUNING}s"
231232
```
@@ -241,7 +242,7 @@ GCSFuse.
241242

242243
POD_READY_TIME_TUNED=`kubectl --namespace "${MLP_MODEL_SERVE_NAMESPACE}" get pods "$POD_NAME_TUNED" -o json | jq -r '.status.conditions[] | select(.type == "Ready") | .lastTransitionTime' | date -f - +%s%3N`
243244

244-
POD_STARTUP_TIME_WITH_TUNING=$(( (POD_READY_TIME_TUNED - POD_SCHEDULED_TIME_TUNED)/1000 ))
245+
POD_STARTUP_TIME_WITH_TUNING=$(((POD_READY_TIME_TUNED - POD_SCHEDULED_TIME_TUNED)/1000 ))
245246

246247
echo "POD STARTUP TIME WITH TUNING - ${POD_STARTUP_TIME_WITH_TUNING}s"
247248
```
@@ -257,7 +258,7 @@ echo $POD_STARTUP_TIME_WITHOUT_TUNING
257258
echo $POD_STARTUP_TIME_WITH_TUNING
258259
```
259260

260-
The pod startup time without fine-tuning will be around 20 minutes and with
261-
fine-tuning, it will be around 3 mins. GCSFuse can facilitate faster download of
262-
the model weights and reduces the time to startup inference server. You will see
263-
good improvements with large models which have weights over several GBs.
261+
The pod startup time is approximately 20 minutes, but with tuning it can be
262+
reduced to around 3 minutes. GCSFuse can further decrease startup time by
263+
enabling faster downloads of model weights, which is especially beneficial for
264+
large models with weights exceeding several GBs.

0 commit comments

Comments
 (0)