You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* add scripts to update region
* clean up python script
* parse variable gpu_type to nccl tests
* fix gpu_type variable in nccl tests
* update gpu_type variable call
* modify variable
* update root
* Add ephemeral storage
* void project_id here and read it as an input
* cleanup scripts
* Add back vpc changes #971
* Add machine type to differentiate the templates for different machine types
* Update UI based on approved description doc
* Remove host maintenance and other order changes
* Set consumption model variable as optional
* Fix error: "Invalid template interpolation value" and some cleanup
* remove default value
---------
Co-authored-by: Yevet <[email protected]>
subtext: Select from locations with available accelerators. If you have a reservation, select the zone where your reservation is located.
41
43
xGoogleProperty:
42
44
type: ET_GCE_ZONE
43
45
gce_zone:
@@ -69,8 +71,9 @@ spec:
69
71
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
70
72
a3_ultra_zone:
71
73
name: a3_ultra_zone
72
-
title: Location for A3 Ultra
74
+
title: Cluster zone
73
75
section: required_config
76
+
subtext: Select from locations with available accelerators. If you have a reservation, select the zone where your reservation is located.
74
77
xGoogleProperty:
75
78
type: ET_GCE_ZONE
76
79
gce_zone:
@@ -84,72 +87,126 @@ spec:
84
87
variableValues:
85
88
- A3 Ultra
86
89
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
87
-
node_count:
88
-
name: node_count
89
-
title: Node Count
90
-
section: required_config
91
-
recipe:
92
-
name: recipe
93
-
title: Deployment options
90
+
a3_mega_consumption_model:
91
+
name: a3_mega_consumption_model
92
+
title: Consumption options
94
93
section: required_config
94
+
subtext: For optimal performance with distributed AI workloads, reserve densely allocated accelerator capacity. See <a href="https://cloud.google.com/ai-hypercomputer/docs/consumption-models"><i>Consumption options</i></a> for more details.</br>
95
95
enumValueLabels:
96
-
- label: GKE Cluster Only
97
-
value: gke
98
-
- label: GKE Cluster with NCCL Tests
99
-
value: gke-nccl
100
-
- label: GKE Cluster with Llama-3.1-7B pretraining benchmark
101
-
value: llama3.1_7b_nemo_pretraining
102
-
- label: GKE Cluster with Llama-3.1-70B pretraining benchmark
103
-
value: llama3.1_7b_nemo_pretraining
104
-
consumption_model:
105
-
name: consumption_model
106
-
title: Consumption model
96
+
- label: Reservation
97
+
value: Reservation
98
+
toggleUsingVariables:
99
+
- variableName: gpu_type
100
+
variableValues:
101
+
- A3 Mega
102
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
103
+
a3_ultra_consumption_model:
104
+
name: a3_ultra_consumption_model
105
+
title: Consumption options
107
106
section: required_config
107
+
subtext: For optimal performance with distributed AI workloads, reserve densely allocated accelerator capacity. See <a href="https://cloud.google.com/ai-hypercomputer/docs/consumption-models"><i>Consumption options</i></a> for more details.</br>
108
108
enumValueLabels:
109
109
- label: Reservation
110
110
value: Reservation
111
-
- label: On Demand
112
-
value: On Demand
111
+
toggleUsingVariables:
112
+
- variableName: gpu_type
113
+
variableValues:
114
+
- A3 Ultra
115
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
113
116
reservation:
114
117
name: reservation
115
-
title: Reservation Name
118
+
title: Reservation name
116
119
section: required_config
117
120
toggleUsingVariables:
118
-
- variableName: consumption_model
121
+
- variableName: a3_mega_consumption_model
122
+
variableValues:
123
+
- Reservation
124
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
125
+
- variableName: a3_ultra_consumption_model
119
126
variableValues:
120
127
- Reservation
121
128
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
122
129
reservation_block:
123
130
name: reservation_block
124
-
title: Reservation Block
131
+
title: Reservation block
125
132
section: required_config
126
133
toggleUsingVariables:
127
-
- variableName: consumption_model
134
+
- variableName: a3_ultra_consumption_model
128
135
variableValues:
129
136
- Reservation
130
137
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
131
138
placement_policy_name:
132
139
name: placement_policy_name
133
-
title: Placement Policy
140
+
title: Placement policy
134
141
section: required_config
135
142
toggleUsingVariables:
136
-
- variableName: consumption_model
143
+
- variableName: a3_mega_consumption_model
137
144
variableValues:
138
145
- Reservation
139
146
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
140
-
host_maintenance:
141
-
name: host_maintenance
142
-
title: Host Maintainance
147
+
recipe:
148
+
name: recipe
149
+
title: Solution deployment option
143
150
section: required_config
151
+
subtext: Select your deployment option.
144
152
enumValueLabels:
145
-
- label: NONE
146
-
value: none
147
-
- label: PERIODIC
148
-
value: periodic
153
+
- label: GKE Cluster Only
154
+
value: gke
155
+
- label: GKE Cluster with NCCL Test
156
+
value: gke-nccl
157
+
- label: GKE Cluster with Llama-3.1-7B pretraining benchmark
158
+
value: llama3.1_7b_nemo_pretraining
159
+
- label: GKE Cluster with Llama-3.1-70B pretraining benchmark
160
+
value: llama3.1_70b_nemo_pretraining
161
+
node_count_gke:
162
+
name: node_count_gke
163
+
title: Node count
164
+
section: required_config
165
+
subtext: Please enter a value >= 0. If using a reservation, ensure that your reservation has the required capacity.
149
166
toggleUsingVariables:
150
-
- variableName: consumption_model
167
+
- variableName: recipe
151
168
variableValues:
152
-
- Reservation
169
+
- gke
170
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
171
+
node_count_gke_nccl:
172
+
name: node_count_gke_nccl
173
+
title: Node count
174
+
section: required_config
175
+
subtext: Please enter a value >= 2. If using a reservation, ensure that your reservation has the required capacity.
176
+
toggleUsingVariables:
177
+
- variableName: recipe
178
+
variableValues:
179
+
- gke-nccl
180
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
181
+
node_count_llama_3_7b:
182
+
name: node_count_llama_3_7b
183
+
title: Node count
184
+
section: required_config
185
+
subtext: Some benchmarks require a minimum number of nodes. If using a reservation, ensure that your reservation has the required capacity
186
+
enumValueLabels:
187
+
- label: 2
188
+
value: 2
189
+
- label: 4
190
+
value: 4
191
+
toggleUsingVariables:
192
+
- variableName: recipe
193
+
variableValues:
194
+
- llama3.1_7b_nemo_pretraining
195
+
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
196
+
node_count_llama_3_70b:
197
+
name: node_count_llama_3_70b
198
+
title: Node count
199
+
section: required_config
200
+
subtext: Some benchmarks require a minimum number of nodes. If using a reservation, ensure that your reservation has the required capacity
201
+
enumValueLabels:
202
+
- label: 32
203
+
value: 32
204
+
- label: 40
205
+
value: 40
206
+
toggleUsingVariables:
207
+
- variableName: recipe
208
+
variableValues:
209
+
- llama3.1_70b_nemo_pretraining
153
210
type: DISPLAY_VARIABLE_TOGGLE_TYPE_UNSPECIFIED
154
211
acknowledge:
155
212
name: acknowledge
@@ -168,7 +225,7 @@ spec:
168
225
- name: acknowledge
169
226
title: Before you begin
170
227
subtext: This solution deploys a sample <a href="https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute"><i>HyperCompute
171
-
Cluster</i></a> on GKE in your project.</br>
228
+
Cluster</i></a> with GKE in your project to run AI/ML and HPC workloads.</br>
0 commit comments