-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge qss-poc branch for HCC back to main #1059
Open
danjuan-81
wants to merge
64
commits into
main
Choose a base branch
from
qss-poc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tion blocks inputs
…consumption model. Rename reservation as reservation_name
* add scripts to update region * clean up python script * parse variable gpu_type to nccl tests * fix gpu_type variable in nccl tests * update gpu_type variable call * modify variable * update root * Add ephemeral storage * void project_id here and read it as an input * cleanup scripts * Add back vpc changes #971 --------- Co-authored-by: Yevet <[email protected]>
* add scripts to update region * clean up python script * parse variable gpu_type to nccl tests * fix gpu_type variable in nccl tests * update gpu_type variable call * modify variable * update root * Add ephemeral storage * void project_id here and read it as an input * cleanup scripts * Add back vpc changes #971 * Add machine type to differentiate the templates for different machine types * Update UI based on approved description doc * Remove host maintenance and other order changes * Set consumption model variable as optional * Fix error: "Invalid template interpolation value" and some cleanup * remove default value --------- Co-authored-by: Yevet <[email protected]>
* Fix reservation toggle for A3U * Modify output based on UI requirement * Enable HOST_MAINTENANCE=PERIODIC for A3 Mega * Fix nccl test workload delete failure due to dependency with cluster module
add readme for scripts to update zones and regions
* Update UI to temporarily remove GPU recipes and update 'Suggested next steps' section * Remove schedulingGate from A3U NCCL test
* add kueue for nccl tests * clean up codes for kueue * cleanup ultra test * solve comments * rename kueue
* Make reservation name variable required * Add a3ultra Llama3.1-7b recipe Update the network to vpc1...9 * Add a3ultra Llama3.1-70b recipe * Remove unused cluster name from Nemo module.
…n >2 nodes (#1007) * add kueue to nemo * enable nccl tests on >2 nodes * fix nemo with kueue * cleanup * resolve conflict * modify reservation for a3u
* Add Llama3.1 70B using MaxText * Add mixtral8 70b NeMo recipe * Add A3Mega Mixtral8-7b NeMo recipe and A3Ultra Mixtral8-7b maxtext recipe * Use the maxtext docker image provided by the GPU recipe team * Update MaxText recipes using 16 nodes
fix nemo
* fix nemo * fix
* Update image * Fix Mixtral model parsing error.
Use new image for MaxText
basic validation test
* add a3m placement not null validation * fix permission problem * fix api issue * future verify: api enable
* initial cloud build for hcc * change project for cloudbuild * disablr mega cluster * cleanup * cleanup * modify deploy name * finalize cloudbuild * remove reservation for gke only * change cloudbuild name * cleanup cloudbuild
#1044) Make reservation name as top properties, set the default value of consumption option to reservation
#1058) Add subtext for reservatioin name and modify the subtext for consumption options.
...munity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py
Dismissed
Show dismissed
Hide dismissed
...munity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py
Dismissed
Show dismissed
Hide dismissed
* fix cicd pipeline * fix cicd pipeline
* add cloudbuild scripts to update regions and zones * add cloudbuild scripts to update regions and zones * fix path for python script * modify readme
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
What this PR does / Why we need it:
Add a new folder hcc under applications and update the cloudbuild.xml by adding new test cases for the new qss.
Which issue(s) this PR fixes:
Closes #
Special notes for your reviewer: