|
1 | 1 | # Inference Cost Optimization
|
2 | 2 |
|
3 |
| -Running inference for large language models can be expensive. Often, this cost |
4 |
| -goes even higher based on your requirements. For example, if your requirement is |
5 |
| -to reduce the latency in starting up inference, it may need advanced |
6 |
| -accelerators on high end virtual machines with expansive storage options. Often, |
7 |
| -there is confusion in choosing the right accelerator , virtual machine and |
8 |
| -storage options when if comes to running the large language models. The goal of |
9 |
| -this guide is to provide the cost efficient and high performant ways to run |
10 |
| -inference for Llama models. |
| 3 | +Running inference for large language models (LLMs) can be expensive and costs |
| 4 | +can increase due to specific requirements. For example, reducing inference |
| 5 | +startup latency may require advanced accelerators on high-end virtual machines |
| 6 | +with extensive storage options. Choosing the right combination of accelerator, |
| 7 | +virtual machine, and storage options for running large language models can be |
| 8 | +complicated. The goal of this guide is to provide cost-efficient and |
| 9 | +high-performance methods for running Llama model inference. |
11 | 10 |
|
12 | 11 | ## Choosing accelerator, machine and storage
|
13 | 12 |
|
14 |
| -Google Cloud provides different types of accelerators, GPUs(L4, A100, H100) and |
15 |
| -TPUs and storage options(GCS, Parallelstore, Hyperdisk ML, Persistent Disk) that |
16 |
| -covers end to end requirements for running large language models in cost |
17 |
| -efficient fashion. |
18 |
| - |
19 |
| -Typically, the type and the number of accelerators that will be used in your |
20 |
| -inference is decided based on the size of your model. For example, if you want |
21 |
| -to run llama 70B model which has weights of about 132GB, you will need at least |
22 |
| -8 nvidia-l4 or 4 nvidia-tesla-a100 or 2 nvidia-a100-80gb to run inference. |
23 |
| -However, you can use additional accelerators to achieve faster model load time |
24 |
| -and serving e.g use 8 nvidia-tesla-a100 instead of 4. |
25 |
| - |
26 |
| -Different VM types provide different number of GPUs, GPU memory, vCPU, VM memory |
27 |
| -and network bandwidth. You can choose the VM once you have decided what and how |
28 |
| -many accelerators you want to use to run inference. For example, if you decide |
29 |
| -to use 8 nvidia-l4 GPUs to serve llama 70B model, you can use g2-standard-96 |
30 |
| -which is the only machine type that provides 8 nvidia-l4 in G2 machine series. |
31 |
| -You can configuration of different machine types at [Google Cloud GPU Machine |
32 |
| -types documentation][gpus] |
33 |
| - |
34 |
| -The storage option is a key factor in determining the cost and performance of |
35 |
| -your inference along with the accelerator. This is because the model is loaded |
36 |
| -from storage into GPU memory to run inference. The more throughput a storage |
37 |
| -provides, the faster will be the model load time and shorter will be the |
38 |
| -inference start up time and Time To First Token. On Google Cloud, the storage |
39 |
| -can be zonal, regional or multi-regional. This means if you use a zonal storage |
40 |
| -and decide to run inference workload in 3 different zones, you will need three |
41 |
| -instance of the storage in each of those zones. Therefore, it becomes critical |
42 |
| -to choose the storage option wisely to optimize the cost. |
| 13 | +Google Cloud offers a variety of accelerators, including |
| 14 | +[Graphics Processing Units (GPUs)](https://cloud.google.com/gpu) and |
| 15 | +[Tensor Processing Units (TPUs)](https://cloud.google.com/tpu) model, as well as |
| 16 | +[storage options](https://cloud.google.com/products/storage) such as Google |
| 17 | +Cloud Storage (GCS), Parallelstore, Hyperdisk ML, and Persistent Disk. These |
| 18 | +comprehensive options enable cost-effective operation of large language models |
| 19 | +for various requirements. |
| 20 | + |
| 21 | +The number and type of accelerators needed for inference is typically determined |
| 22 | +by the size of your model. For instance, running a |
| 23 | +[Llama](https://www.llama.com/) 70B model, which has weights of roughly 132GB, |
| 24 | +requires a minimum of eight NVIDIA L4 GPUs, four NVIDIA A100 40GB GPUs, or two |
| 25 | +NVIDIA A100 80GB GPUs. However, using additional accelerators, such as using |
| 26 | +eight NVIDIA A100 40GB instead of four, can result in faster model loading times |
| 27 | +and improved inference. |
| 28 | + |
| 29 | +The number of GPUs, amount of GPU memory, vCPU, memory, and network bandwidth |
| 30 | +are all factors that vary across different virtual machine types. After deciding |
| 31 | +on the type and quantity of accelerators required for inference, you can select |
| 32 | +the appropriate VM family and type. For example, if you need eight NVIDIA L4 |
| 33 | +GPUs to serve the Llama 70B model, in the G2 instances the `g2-standard-96` |
| 34 | +machine type is the only one that can accommodate this requirement. For an |
| 35 | +overview of the different GPU VMs that are available on Compute Engine, see the |
| 36 | +[GPU Machine types](https://cloud.google.com/compute/docs/gpus) documentation. |
| 37 | + |
| 38 | +The storage and accelerator that you choose are key factors that affect the cost |
| 39 | +and performance of your inference. To run inference, the model must be loaded |
| 40 | +from storage into GPU memory. Thus, storage throughput affects model load time, |
| 41 | +inference start-up time, and time to first token (TTFT). Google Cloud storage |
| 42 | +can be zonal, regional, or multi-regional. If you use zonal storage and run |
| 43 | +inference workloads in three different zones, you will need three instances of |
| 44 | +storage, one in each of the zones. Choosing the right storage option is critical |
| 45 | +for cost optimization. |
43 | 46 |
|
44 | 47 | ## Storage optimization
|
45 | 48 |
|
46 |
| -In the [GCS storage optimization][gcs-storage-optimization] guide, we |
47 |
| -demonstrate how you can fine tune GCS to achieve the best performance with lower |
48 |
| -cost. T |
49 |
| - |
50 |
| ---- |
51 |
| - |
52 |
| -[gpus]: https://cloud.google.com/compute/docs/gpus |
53 |
| -[gcs-storage-optimization]: |
54 |
| - /use-cases/inferencing/cost-optimization/gcsfuse/README.md |
| 49 | +In the |
| 50 | +[GCS storage optimization](/use-cases/inferencing/cost-optimization/gcsfuse/README.md) |
| 51 | +guide, we demonstrate how you can tune GCS to achieve the best cost performance. |
0 commit comments