Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge qss-poc branch for HCC back to main #1059

Open
wants to merge 64 commits into
base: main
Choose a base branch
from
Open

Merge qss-poc branch for HCC back to main #1059

wants to merge 64 commits into from

Conversation

danjuan-81
Copy link
Collaborator

@danjuan-81 danjuan-81 commented Apr 4, 2025

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind enhancement
/kind new quick start solution for HyperCompute Cluster

What this PR does / Why we need it:
Add a new folder hcc under applications and update the cloudbuild.xml by adding new test cases for the new qss.

Which issue(s) this PR fixes:

Closes #

Special notes for your reviewer:

ACW101 and others added 30 commits December 6, 2024 23:51
…consumption model. Rename reservation as reservation_name
* Remove checkpoint bucket input

* Fix llama7b config and incorrect recipe enum values

* Remove region intput and add logic to lookup region by zone
* Add missing GCS module

* Small fixes.
danjuan-81 and others added 25 commits February 20, 2025 10:17
* add scripts to update region

* clean up python script

* parse variable gpu_type to nccl tests

* fix gpu_type variable in nccl tests

* update gpu_type variable call

* modify variable

* update root

* Add ephemeral storage

* void project_id here and read it as an input

* cleanup scripts

* Add back vpc changes #971

---------

Co-authored-by: Yevet <[email protected]>
* add scripts to update region

* clean up python script

* parse variable gpu_type to nccl tests

* fix gpu_type variable in nccl tests

* update gpu_type variable call

* modify variable

* update root

* Add ephemeral storage

* void project_id here and read it as an input

* cleanup scripts

* Add back vpc changes #971

* Add machine type to differentiate the templates for different machine types

* Update UI based on approved description doc

* Remove host maintenance and other order changes

* Set consumption model variable as optional

* Fix error: "Invalid template interpolation value" and some cleanup

* remove default value

---------

Co-authored-by: Yevet <[email protected]>
* Fix reservation toggle for A3U

* Modify output based on UI requirement

* Enable HOST_MAINTENANCE=PERIODIC for A3 Mega

* Fix nccl test workload delete failure due to dependency with cluster module
add readme for scripts to update zones and regions
* Update UI to temporarily remove GPU recipes and update 'Suggested next steps' section

* Remove schedulingGate from A3U NCCL test
* add kueue for nccl tests

* clean up codes for kueue

* cleanup ultra test

* solve comments

* rename kueue
* Make reservation name variable required

* Add a3ultra Llama3.1-7b recipe
Update the network to vpc1...9

* Add a3ultra Llama3.1-70b recipe

* Remove unused cluster name from Nemo module.
…n >2 nodes (#1007)

* add kueue to nemo

* enable nccl tests on >2 nodes

* fix nemo with kueue

* cleanup

* resolve conflict

* modify reservation for a3u
* Add Llama3.1 70B using MaxText

* Add mixtral8 70b NeMo recipe

* Add A3Mega Mixtral8-7b NeMo recipe and A3Ultra Mixtral8-7b maxtext recipe

* Use the maxtext docker image provided by the GPU recipe team

* Update MaxText recipes using 16 nodes
* fix nemo

* fix
* Update image

* Fix Mixtral model parsing error.
Use new image for MaxText
* add a3m placement not null validation

* fix permission problem

* fix api issue

* future verify: api enable
* initial cloud build for hcc

* change project for cloudbuild

* disablr mega cluster

* cleanup

* cleanup

* modify deploy name

* finalize cloudbuild

* remove reservation for gke only

* change cloudbuild name

* cleanup cloudbuild
#1044)

Make reservation name as top properties, set the default value of consumption option to reservation
#1058)

Add subtext for reservatioin name and modify the subtext for consumption options.
danjuan-81 and others added 3 commits April 4, 2025 10:55
* fix cicd pipeline

* fix cicd pipeline
* add cloudbuild scripts to update regions and zones

* add cloudbuild scripts to update regions and zones

* fix path for python script

* modify readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants