Skip to content

Commit 6d51e75

Browse files
committed
Add documentation
1 parent 8d42030 commit 6d51e75

File tree

6 files changed

+119
-37
lines changed

6 files changed

+119
-37
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Machine Learning Infrastructure Monitoring for Slurm based cluster <!-- omit from toc -->
2+
3+
This solution provides a "1 click" deployment observability stack to monitor your slurm based machine learning infrastructure. It automatically creates:
4+
- Amazon Managed Grafana
5+
- Amazon Managed Prometheus
6+
- Setup a Prometheus Agent Collector
7+
- Create data source and dashboard in Grafana
8+
9+
10+
## Prerequisites
11+
12+
Install AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version **>=1.135.0** by following the [instructions](<https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html>)
13+
14+
15+
## Architecture
16+
TBD
17+
18+
## Deploy
19+
You will begin by installing the necessary Python package needed for the lambda function.
20+
In your shell:
21+
22+
```bash
23+
cd dashboards
24+
pip install -r requirements.txt -t .
25+
cd ..
26+
```
27+
28+
You are now ready to deploy the serverless application, run the following in your shell:
29+
30+
```
31+
OBS_DASHBOARD_NAME="ml-obs-dashboard"
32+
sam build
33+
sam deploy --stack-name ${OBS_DASHBOARD_NAME} \
34+
--guided \
35+
--capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
36+
--parameter-overrides \
37+
ParameterKey=PCClusterName,ParameterValue=<CLUSTER_NAME> \
38+
ParameterKey=SubnetId,ParameterValue=<SUBNET_ID> \
39+
ParameterKey=VpcId,ParameterValue=<VPC_ID>
40+
```
41+
42+
43+
## Clean up
44+
To delete the SAM application deployment, you can use the terminal and enter:
45+
46+
```bash
47+
sam delete
48+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Description: "Setup to monitor sagemaker hyperpod clusters on AWS. Amazon Managed Prometheus and Amazon Manged Grafana workspaces with associated IAM roles are deployed in the AWS Account. Prometheus and exporter services are set up on Cluster Nodes. Author: Matt Nightingale - nghtm@"
3+
4+
5+
Resources:
6+
AmazonGrafanaWorkspaceIAMRole:
7+
Type: 'AWS::IAM::Role'
8+
Properties:
9+
AssumeRolePolicyDocument:
10+
Version: 2012-10-17
11+
Statement:
12+
- Effect: Allow
13+
Principal:
14+
Service:
15+
- grafana.amazonaws.com
16+
Action:
17+
- 'sts:AssumeRole'
18+
RoleName: !Sub ${AWS::StackName}-Grafana-Role
19+
20+
AmazonGrafanaPrometheusPolicy:
21+
Type: AWS::IAM::Policy
22+
Properties:
23+
PolicyName: AmazonGrafana_Prometheus_policy
24+
PolicyDocument:
25+
Version: '2012-10-17'
26+
Statement:
27+
- Effect: Allow
28+
Action:
29+
- aps:ListWorkspaces
30+
- aps:DescribeWorkspace
31+
- aps:QueryMetrics
32+
- aps:GetLabels
33+
- aps:GetSeries
34+
- aps:GetMetricMetadata
35+
Resource: "*"
36+
Roles: [!Ref AmazonGrafanaWorkspaceIAMRole]
37+
38+
AmazonGrafanaWorkspace:
39+
Type: 'AWS::Grafana::Workspace'
40+
Properties:
41+
AccountAccessType: CURRENT_ACCOUNT
42+
Name: !Sub ${AWS::StackName}-Dashboard
43+
Description: Amazon Grafana Workspace to monitor SageMaker Cluster
44+
AuthenticationProviders:
45+
- AWS_SSO
46+
PermissionType: SERVICE_MANAGED
47+
RoleArn: !GetAtt
48+
- AmazonGrafanaWorkspaceIAMRole
49+
- Arn
50+
DataSources: ["CLOUDWATCH","PROMETHEUS"]
51+
OrganizationRoleName: "ADMIN"
52+
53+
APSWorkspace:
54+
Type: AWS::APS::Workspace
55+
Properties:
56+
Alias: !Sub ${AWS::StackName}-Hyperpod-WorkSpace
57+
Tags:
58+
- Key: Name
59+
Value: SageMaker Hyperpod PrometheusMetrics
60+
61+
Outputs:
62+
Region:
63+
Value: !Ref "AWS::Region"
64+
AMPRemoteWriteURL:
65+
Value: !Join ["" , [ !GetAtt APSWorkspace.PrometheusEndpoint , "api/v1/remote_write" ]]
66+
AMPEndPointUrl:
67+
Value: !GetAtt APSWorkspace.PrometheusEndpoint
68+
GrafanWorkspaceURL:
69+
Value: !Join ["" , [ "https://", !GetAtt AmazonGrafanaWorkspace.Endpoint ]]
70+
GrafanWorkspaceId:
71+
Value: !GetAtt AmazonGrafanaWorkspace.Id

4.validation_and_observability/4.prometheus-grafana/dashboards/create_ml_dashboards.py 4.validation_and_observability/4.prometheus-grafana/1click-dashboards-deployment/dashboards/create_ml_dashboards.py

-37
Original file line numberDiff line numberDiff line change
@@ -119,40 +119,3 @@ def lambda_handler(event, context):
119119
workspaceId=workspace_id)
120120

121121
return {'statusCode': 200, 'body': json.dumps('Dashboards created')}
122-
123-
124-
def main():
125-
aws_region = 'ap-southeast-2'
126-
grafana_key_name = "CreateDashpa01"
127-
grafana_url = 'https://g-182a00efff.grafana-workspace.ap-southeast-2.amazonaws.com'
128-
workspace_id = 'g-182a00efff'
129-
prometheus_url = 'https://aps-workspaces.ap-southeast-2.amazonaws.com/workspaces/ws-e4384558-d586-46ec-bd12-173d7019119e/'
130-
131-
client = boto3.client('grafana')
132-
response = client.create_workspace_api_key(keyName=grafana_key_name,
133-
keyRole='ADMIN',
134-
secondsToLive=60,
135-
workspaceId=workspace_id)
136-
137-
try:
138-
grafana = GrafanaApi.from_url(
139-
url=grafana_url,
140-
credential=TokenAuth(token=response['key']),
141-
)
142-
143-
prometheus_datasource = create_prometheus_datasource(
144-
grafana, prometheus_url, aws_region)
145-
146-
for i in PROM_DASHBOARDS_URL:
147-
dashboard_payload = mk_dash(prometheus_datasource['uid'], i)
148-
response = grafana.dashboard.update_dashboard(dashboard_payload)
149-
150-
except Exception as e:
151-
print(e)
152-
153-
response = client.delete_workspace_api_key(keyName=grafana_key_name,
154-
workspaceId=workspace_id)
155-
156-
157-
if __name__ == '__main__':
158-
main()

0 commit comments

Comments
 (0)