Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge qss-poc branch for HCC back to main #1059

Open
wants to merge 65 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
6c4ba4f
POC cluster-toolkit + launchpad
ACW101 Dec 6, 2024
2f61a5b
Update QSS UI
ACW101 Dec 13, 2024
aea10bc
Minor fixes to terraform templates
ACW101 Dec 16, 2024
4792578
Add specific reservation and node count as GKE nodepool input
ACW101 Jan 15, 2025
cae7675
Migrate legacy module kubectl-apply
ACW101 Jan 15, 2025
e91b8df
Add example workloads.tfvars
ACW101 Jan 16, 2025
5c716f2
Add conditional reservation affinity based on reservation and reserva…
ACW101 Jan 17, 2025
ee946f3
Some cleanup
ACW101 Jan 17, 2025
9f2cb4f
Support reservation with placement_policy and some misc fixes
ACW101 Jan 22, 2025
36c3ab0
Fix duplicate deployment_name variable
ACW101 Jan 23, 2025
18b6295
Add GPU type and cluster name
danjuan-81 Jan 24, 2025
41384ea
update the lauchpad recipe to Deployment Options
danjuan-81 Jan 24, 2025
f97a088
Add recipe switching logic. (#950)
ACW101 Jan 25, 2025
0317fbb
Dynamic UI: 1) location=zone based on machine type = gpu type and 2) …
danjuan-81 Jan 27, 2025
edc6d5b
Add host maintenance for reservation option
danjuan-81 Jan 27, 2025
c224d8a
Replace Ray with HyperComputer Cluster
danjuan-81 Jan 27, 2025
d87b794
Add more zones and remove a4 zone selection
danjuan-81 Jan 27, 2025
43a7115
Fix typo for recipe which fails the required field validation check
danjuan-81 Jan 27, 2025
12b5dc2
Add NCCL test switching logic (#954)
Yevet Jan 28, 2025
3108700
Fix toggle type error
danjuan-81 Jan 28, 2025
a6182e0
Fix the toggle type for consumption model
danjuan-81 Jan 28, 2025
66b5660
Some small fixes and temporarily disable UI visibility toggle (#955)
ACW101 Jan 28, 2025
8da6d19
make all toggle properties as not required
danjuan-81 Jan 29, 2025
632d60a
Dev (#958)
ACW101 Jan 30, 2025
6a62658
merge upstream changes
danjuan-81 Jan 30, 2025
945fef8
Add missing GCS module (#959)
ACW101 Jan 31, 2025
5e3159c
delete checkpoint in UI
danjuan-81 Jan 31, 2025
6ff2569
Merge remote-tracking branch 'upstream/qss-poc' into qss-poc
danjuan-81 Jan 31, 2025
c292be9
Fix some words
danjuan-81 Jan 31, 2025
7c2a26d
Dev 2 (#960)
ACW101 Jan 31, 2025
63b4652
Convert A3 Mega NCCL test from pods to a jobset (#961)
ACW101 Feb 5, 2025
b1e34ab
A3U and A3M (#969)
danjuan-81 Feb 10, 2025
16fa586
Replace subnetwork name with vpc (#971)
danjuan-81 Feb 11, 2025
9bd8aed
Prefix GCS bucket name with project_id to avoid conflict (#975)
ACW101 Feb 11, 2025
1ee8d2b
Add Kueue (#978)
danjuan-81 Feb 13, 2025
cdb7f13
add ultra nccl tests (#966)
Yevet Feb 15, 2025
e390b7e
Qss poc (#990)
danjuan-81 Feb 20, 2025
858f3b5
Qss poc (#991)
danjuan-81 Feb 21, 2025
3f49db4
Some UI fixes and terraform fixes (#993)
ACW101 Feb 25, 2025
fb98776
Add readme for scripts to update regions and zones (#994)
Yevet Feb 25, 2025
88e9461
Some UI update and fix scheduling issue of NCCL test for A3U (#997)
ACW101 Feb 26, 2025
ea70100
Add kueue to A3Mega & enable TAS NCCL tests on kueue (#995)
Yevet Feb 26, 2025
8e14483
Fix cluster only validation error. (#1000)
ACW101 Feb 26, 2025
d25fdff
Add A3Ultra Llama3.1-7b and Llama3.1-70b recipe (#1004)
danjuan-81 Feb 28, 2025
ae17f4c
Upgrade A3Mega Nemo recipe with TAS using Kueue & enable NCCL tests o…
Yevet Mar 5, 2025
03b6d15
[A3Ultra] Add Llama3.1-70B MaxText and Mixtral8-7B Nemo recipes (#1014)
danjuan-81 Mar 6, 2025
786596a
Fix typo
danjuan-81 Mar 6, 2025
f2781c0
fix nemo on a3mega (#1021)
Yevet Mar 7, 2025
00e8366
Fix nemo (#1024)
Yevet Mar 7, 2025
f59b981
Update the recipe image and fix the parse error (#1030)
danjuan-81 Mar 11, 2025
84af8cf
add a3m placement not null validation (#1033)
Yevet Mar 12, 2025
93157cd
Use MaxText public image (#1038)
danjuan-81 Mar 13, 2025
a768bdf
Set the default value of node count variables to negative -1. (#1039)
danjuan-81 Mar 14, 2025
31ab18e
add basic validations to hcc (#1040)
Yevet Mar 15, 2025
9ce9444
fix permission bug for bucket access (#1037)
Yevet Mar 15, 2025
bb0f16e
Set default value for node_count_gke=0 and node_count_gke_nccl=2
danjuan-81 Mar 19, 2025
0a95732
add cloud build check for qss-poc branch (#1043)
Yevet Mar 20, 2025
e1f0b8e
Make reservation name as top properties, set the default value of con…
danjuan-81 Mar 20, 2025
3a0c0d6
Update region/zone list (#1056)
ACW101 Apr 3, 2025
c83bad4
Add subtext for reservatioin name and modify the subtext for consumpt…
danjuan-81 Apr 4, 2025
01f1ae3
Fix subtext for consumption model
danjuan-81 Apr 4, 2025
8d35e98
Merge branch 'main' into qss-poc
danjuan-81 Apr 4, 2025
0be27b3
fix cicd pipeline (#1060)
Yevet Apr 4, 2025
d4ee1e6
Automatically update regions and zones (#1057)
Yevet Apr 4, 2025
f8bb902
small fix on hcc qss (#1063)
Yevet Apr 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
44 changes: 44 additions & 0 deletions applications/hcc/README_update_zone.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# GCP Zone and Region Availability Updater

This Python script updates zones and regions in a metadata YAML file and a JSON file based on the availability of specific Google Compute Engine (GCE) machine types in a given Google Cloud Project. It provides two main functionalities:

1. **Updating Zones in YAML:** It updates the `allowlisted_zones` field in a YAML file (`metadata.display.yaml`) for specified machine types (e.g., "a3-megagpu-8g", "a3-ultragpu-8g").

2. **Generating Zone-to-Region Mapping:** It generates a JSON file (e.g., `zone_to_region.json`) that provides a mapping between zones and their corresponding regions in your GCP project. The main reason why we need `zone_to_region.json` is for region lookup in [region.tf](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/qss-poc/applications/hcc/region.tf#L2).

The script uses the Google Cloud Compute Engine API to retrieve information about regions, zones, and machine type availability. It leverages the `google-cloud-compute` and `PyYAML` libraries for interacting with the API and processing YAML files.

## Prerequisites

* **Python 3.6 or higher**
* **Google Cloud Project:** A project with the necessary APIs enabled (Compute Engine API).
* **Authentication:** Ensure your environment is authenticated to your Google Cloud Project. The easiest way is to use the `gcloud` command-line tool:
```bash
gcloud auth application-default login
```
* **Required Python packages:**
```Bash
pip install google-cloud-compute PyYAML
```
## Usage
* Run the script with the `--project_id` argument:
```
python update-region.py --project_id your-project-id
```
Replace `your-project-id` with your actual Google Cloud Project ID.


### Example
To update the `metadata.display.yaml` file with the available zones for the `a3-megagpu-8g` and `a3-ultragpu-8g` machine types in the `my-gcp-project` project, and generate the `zone_to_region.json` file:
```Bash
python update-region.py --project_id my-gcp-project
```
This will:

* Create or update the `zone_to_region.json` file with the zone-to-region mapping for your project.
* Update the `allowlisted_zones` in `metadata.display.yaml` for both `a3-megagpu-8g` (under `a3_mega_zone`) and `a3-ultragpu-8g` (under `a3_ultra_zone`) with the zones where these machine types are currently available.


## Notes
* The script uses the application default credentials for authentication. Make sure you've authenticated your environment using `gcloud auth application-default login`.
* The script currently supports updating allowlisted_zones for `a3-megagpu-8g` and `a3-ultragpu-8g` machine types. You can extend it to support other machine types by modifying the `update_blueprint_metadata` function.
45 changes: 45 additions & 0 deletions applications/hcc/a3mega_workloads.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/**
* Copyright 2023 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

authorized_cidr = "0.0.0.0/0"

goog_cm_deployment_name = "a3mega-qss-test"

labels = {
ghpc_blueprint = "gke-a3-mega"
ghpc_deployment = "a3mega-qss-test"
}

project_id = "gke-ai-eco-dev"

a3_mega_zone = "australia-southeast1-c"
a3_ultra_zone = ""

gpu_type = "A3 Mega"

reservation = "hcc-a3mega"
reservation_block = ""
placement_policy_name = "kevinmcw-test"

a3mega_recipe = "gke"
a3ultra_recipe = ""
node_count_gke_nccl = -1
node_count_gke = 0
node_count_nemo = -1
node_count_maxtext = -1
node_count_llama_3_7b = -1
a3_mega_consumption_model = "Reservation"
a3_ultra_consumption_model = ""
54 changes: 54 additions & 0 deletions applications/hcc/a3ultra_workloads.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
/**
* Copyright 2023 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

authorized_cidr = "0.0.0.0/0"

goog_cm_deployment_name = "ultra-cluster-test"

labels = {
ghpc_blueprint = "gke-a3-ultra"
ghpc_deployment = "a3ultra-hcc-test"
}

project_id = "gke-ai-eco-dev"

a3_mega_zone = ""
a3_ultra_zone = "europe-west1-b"

node_count_gke_nccl = -1
node_count_gke = 0
node_count_nemo = -1
node_count_maxtext = -1
node_count_llama_3_7b = -1

# A3 Ultra recipe options:
# - "gke"
# - "gke-nccl"
# - "llama3.1_7b_nemo_pretraining"
# - "llama3.1_70b_nemo_pretraining"
# - "llama3.1_70b_maxtext_pretraining"
# - "mixtral8_7b_nemo_pretraining"
# - "mixtral8_7b_maxtext_pretraining"
a3ultra_recipe = "gke"
a3mega_recipe=""

reservation = ""
reservation_block = ""
placement_policy_name = ""

gpu_type = "A3 Ultra"
a3_ultra_consumption_model = "Reservation"
a3_mega_consumption_model = ""
Loading