Skip to content

Commit 9e89326

Browse files
authored
Model load with gcs fuse benchmarking tool (#863)
* Initial commit * Benchmarking tool for gke with gcsfuse, data/model loading time * Benchmarking tool for gke with gcsfuse, data/model loading time Signed-off-by: Kunjan Patel <[email protected]> * Add more information to Readme Signed-off-by: Kunjan Patel <[email protected]> * Add more information to Readme Signed-off-by: Kunjan Patel <[email protected]> * Yaml format Signed-off-by: Kunjan Patel <[email protected]> * format requirements.txt Signed-off-by: Kunjan Patel <[email protected]> * Add check for pod node placement before begining profiling Signed-off-by: Kunjan Patel <[email protected]> * Add check for pod node placement before begining profiling Signed-off-by: Kunjan Patel <[email protected]> * Move under benchmarks/benchmark/tools Signed-off-by: Kunjan Patel <[email protected]> --------- Signed-off-by: Kunjan Patel <[email protected]>
1 parent 7c1ab76 commit 9e89326

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+8203
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Benchmarker CLI
2+
This tool is to help user iterate over different configurations for GCSFuse and benchmark the data downloading time. [More details on available options for gcsfuse mount options in GKE](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#mounting-flags)
3+
4+
## Table of Contents
5+
- [Architecture](#architecture)
6+
- [Installation](#installation)
7+
- [From Source](#from-source)
8+
- [As a Go Package](#as-a-go-package)
9+
- [Usage](#usage)
10+
- [Commands](#commands)
11+
- [Examples](#examples)
12+
13+
## Architecture
14+
The tool generates variety of GCSFuse mount configurations based on a configuration. The configuration parameters can contains base values, step size and max. This is used in generating all valid configurations( requested resource is less than resource limit). It also takes in pod specification, mutates it with gcsfuse configuration by configuring and mounting the csi volume. Then it deploys the pod and profiles the time till all containers in the pod are ready. This along with the configuration is outputted to results directory. There is a [matplotlib script](plot.py) which can be triggered to generate scatterplots for loading times against specific configuration parameter's value
15+
## Installation
16+
17+
### From Source
18+
1. **Clone the repository**:
19+
20+
2. **Build the CLI tool**:
21+
```bash
22+
go build -o benchmarker
23+
```
24+
25+
3. **Move the executable** (optional):
26+
```bash
27+
mv benchmarker /usr/local/bin/
28+
```
29+
This allows you to use the `benchmarker` command globally.
30+
31+
## Setup
32+
```bash
33+
gcloud container clusters get-credentials
34+
```
35+
Ensure cluster credentials are configured in kubeconfig with gcloud credential helper.
36+
The cluster must be able to scale up nodes or have existing nodes.
37+
38+
## Usage
39+
40+
The Benchmarker CLI provides commands to set configurations and run benchmarks.
41+
42+
### Commands
43+
44+
#### `config`
45+
Manage configurations for benchmarks.
46+
47+
- **Usage**: `benchmarker config [subcommand]`
48+
- **Subcommands**:
49+
- `set`: Set a configuration file for benchmarks.
50+
51+
#### `run`
52+
Run the benchmark with the current configuration.
53+
54+
- **Usage**: `benchmarker run`
55+
- **Description**: Executes the benchmark process based on the specified configuration file.
56+
57+
## Examples
58+
59+
### Have a pod spec for benchmarking
60+
Create a pod spec you want to benchmark data loading time for,
61+
make sure to configure Readiness probes to ensure that data expected is loaded by fuse.
62+
Also add necessary node selectors to ensure benchmarking pods are run on preferred nodes.
63+
[Example pod spec](example-pod.yaml)
64+
65+
### Set a Configuration File
66+
To set a configuration file named `config.yaml`, use:
67+
```bash
68+
benchmarker config set -f config.yaml
69+
```
70+
[Example config](base-config.yaml). Set limits higher than base,
71+
ensure the units are consistent in base and max value. Cases with Bool fields set to false and true are both generated. When file cache is not enabled, other settings are not applied. Some cases may result in failure, due to pod scheduling. Required field in configuration
72+
- `basePodSpec`
73+
- `volumeAttributes.bucketName`
74+
- `volumeAttributes.mountOptions.only-dir`
75+
Available [SidecarResource](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#sidecar-container-resources) and [VolumeAttribute configuration fields](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#mounting-flags)
76+
77+
78+
79+
### Run a Benchmark
80+
After setting the configuration, run the benchmark with:
81+
```bash
82+
benchmarker run
83+
```
84+
85+
## Plotting Results
86+
87+
The Benchmarker CLI includes a result visualization feature to help analyze benchmark performance across different configurations. This feature loads YAML result files, extracts key metrics, and generates scatter plots for elapsed time against various configuration parameters.
88+
89+
### Prerequisites
90+
Ensure you have the following Python packages installed:
91+
```bash
92+
pip install -r requirements.txt
93+
```
94+
95+
### Results directory
96+
The YAML result files should be stored in a directory named `results`, with filenames following the format `case_<number>.yaml` (e.g., `case_1.yaml`, `case_2.yaml`).
97+
98+
### Running the Plotting Script
99+
100+
1. **Generate YAML result files** by running your benchmarks and saving the results in the `results` directory.
101+
2. **Run the plotting script** to generate scatter plots:
102+
```bash
103+
python plot_results.py
104+
```
105+
106+
This script will:
107+
**Generate Plots**: Scatter plots showing elapsed time versus each parameter, saved as PNG files in the `results` directory.
108+
Each point on the scatter plots is labeled with the **case number** and its configuration is saved in `case_**case_number**.yaml`
109+
110+
## Example Plots
111+
### Elapsed Time vs Max Parallel Downloads
112+
![Elapsed Time vs Max Parallel Downloads](results/elapsed_time_vs_cpu_request.png)
113+
![Elapsed Time vs Max Parallel Downloads](results/elapsed_time_vs_max_parallel_downloads.png)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
basePodSpec: "example-pod.yaml"
2+
sideCarResources:
3+
cpu-limit:
4+
base: 20
5+
max: 20
6+
step: 5
7+
memory-limit:
8+
base: 2Gi
9+
max: 2Gi
10+
step: 20
11+
ephemeral-storage-limit:
12+
base: 50Gi
13+
max: 50Gi
14+
step: 20
15+
cpu-request:
16+
base: 200m
17+
max: 250m
18+
step: 50
19+
memory-request:
20+
base: 1Gi
21+
max: 3Gi
22+
step: 2
23+
ephemeral-storage-request:
24+
base: 40Gi
25+
max: 40Gi
26+
step: 10
27+
volumeAttributes:
28+
bucketName: "vertex-model-garden-public-us"
29+
mountOptions:
30+
implicit-dirs: true
31+
only-dir: "codegemma/codegemma-2b"
32+
file-cache:
33+
enable-parallel-downloads: true
34+
parallel-downloads-per-file:
35+
base: 4
36+
step: 5
37+
max: 5
38+
max-parallel-downloads:
39+
base: 2
40+
step: 2
41+
max: 5
42+
download-chunk-size-mb:
43+
base: 3
44+
step: 3
45+
max: 6
46+
fileCacheCapacity:
47+
base: 10Gi
48+
step: 2
49+
max: 10Gi
50+
fileCacheForRangeRead: true
51+
metadataStatCacheCapacity:
52+
base: 500Mi
53+
step: 20
54+
max: 500Mi
55+
metadataTypeCacheCapacity:
56+
base: 500Mi
57+
step: 20
58+
max: 500Mi
59+
metadataCacheTTLSeconds:
60+
base: 600
61+
step: 20
62+
max: 620
63+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[default]
2+
MODEL_LOAD_BENCHMARK_CONFIG = base-config.yaml

0 commit comments

Comments
 (0)