Skip to content

Commit 4f356d4

Browse files
authored
slurm queues: disable SMT on intel nodes and add m7a and m7i
* Disable SMT on c7i nodes by default * Update _slurm_queues.tpl.yaml Add another queue with memory-optimized nodes (m7a and m7i). * Update _slurm_queues.tpl.yaml Add another queue with memory-optimized nodes (m7a and m7i) in config-dev.
1 parent ea90ec6 commit 4f356d4

File tree

2 files changed

+72
-0
lines changed

2 files changed

+72
-0
lines changed

hpc_provisioner/src/hpc_provisioner/config-dev/_slurm_queues.tpl.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,42 @@
116116
- InstanceType: c7i.48xlarge # compute optimized nodes
117117
MinCount: 0
118118
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
119+
DisableSimultaneousMultithreading: true
120+
Efa: # low-latency, high BW network
121+
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
122+
Networking:
123+
PlacementGroup: # try to place nodes close to each other
124+
Enabled: true
125+
SubnetIds: [!config base_subnet_id]
126+
SecurityGroups:
127+
- !config base_security_group_id
128+
- !config efa_security_group_id # Efa
129+
CustomSlurmSettings:
130+
MaxNodes: 20
131+
MaxTime: 720
132+
Iam:
133+
S3Access:
134+
- BucketName: sboinfrastructureassets-sandbox
135+
136+
# ==============================================================
137+
# prod-mpi-mem queue, for tightly-coupled, memory-intensive jobs
138+
# ==============================================================
139+
- Name: prod-mpi-mem
140+
AllocationStrategy: lowest-price # usually on-demand
141+
ComputeResources:
142+
- Name: cpu-m7a
143+
Instances:
144+
- InstanceType: m7a.48xlarge # memory-optimized nodes, AMD arch
145+
MinCount: 0
146+
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
147+
Efa: # low-latency, high BW network
148+
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
149+
- Name: cpu-m7i
150+
Instances:
151+
- InstanceType: m7i.48xlarge # memory-optimized nodes, Intel arch
152+
MinCount: 0
153+
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
154+
DisableSimultaneousMultithreading: true
119155
Efa: # low-latency, high BW network
120156
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
121157
Networking:

hpc_provisioner/src/hpc_provisioner/config/_slurm_queues.tpl.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,42 @@
116116
- InstanceType: c7i.48xlarge # compute optimized nodes
117117
MinCount: 0
118118
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
119+
DisableSimultaneousMultithreading: true
120+
Efa: # low-latency, high BW network
121+
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
122+
Networking:
123+
PlacementGroup: # try to place nodes close to each other
124+
Enabled: true
125+
SubnetIds: [!config base_subnet_id]
126+
SecurityGroups:
127+
- !config base_security_group_id
128+
- !config efa_security_group_id # Efa
129+
CustomSlurmSettings:
130+
MaxNodes: 20
131+
MaxTime: 720
132+
Iam:
133+
S3Access:
134+
- BucketName: sboinfrastructureassets-sandbox
135+
136+
# ==============================================================
137+
# prod-mpi-mem queue, for tightly-coupled, memory-intensive jobs
138+
# ==============================================================
139+
- Name: prod-mpi-mem
140+
AllocationStrategy: lowest-price # usually on-demand
141+
ComputeResources:
142+
- Name: cpu-m7a
143+
Instances:
144+
- InstanceType: m7a.48xlarge # memory-optimized nodes, AMD arch
145+
MinCount: 0
146+
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
147+
Efa: # low-latency, high BW network
148+
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
149+
- Name: cpu-m7i
150+
Instances:
151+
- InstanceType: m7i.48xlarge # memory-optimized nodes, Intel arch
152+
MinCount: 0
153+
MaxCount: 20 # least number of nodes needed to simulate the full O1 circuit x2
154+
DisableSimultaneousMultithreading: true
119155
Efa: # low-latency, high BW network
120156
Enabled: true # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security
121157
Networking:

0 commit comments

Comments
 (0)