Skip to content

Commit 10ee074

Browse files
committed
Add 1click grafana dashboard deployment
1 parent 4039347 commit 10ee074

File tree

7 files changed

+747
-0
lines changed

7 files changed

+747
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Machine Learning Infrastructure Monitoring for Slurm based cluster <!-- omit from toc -->
2+
3+
This solution provides a "1 click" deployment observability stack to monitor your slurm based machine learning infrastructure. It automatically creates:
4+
- Amazon Managed Grafana
5+
- Amazon Managed Prometheus
6+
- Setup a Prometheus Agent Collector
7+
- Create data source and dashboard in Grafana
8+
9+
10+
## Prerequisites
11+
12+
Install AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version **>=1.135.0** by following the [instructions](<https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html>)
13+
14+
15+
## Architecture
16+
TBD
17+
18+
## Deploy
19+
You will begin by installing the necessary Python package needed for the lambda function.
20+
In your shell:
21+
22+
```bash
23+
cd dashboards
24+
pip install -r requirements.txt -t .
25+
cd ..
26+
```
27+
28+
You are now ready to deploy the serverless application, run the following in your shell:
29+
30+
```
31+
OBS_DASHBOARD_NAME="ml-obs-dashboard"
32+
sam build
33+
sam deploy --stack-name ${OBS_DASHBOARD_NAME} \
34+
--guided \
35+
--capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
36+
--parameter-overrides \
37+
ParameterKey=PCClusterName,ParameterValue=<CLUSTER_NAME> \
38+
ParameterKey=SubnetId,ParameterValue=<SUBNET_ID> \
39+
ParameterKey=VpcId,ParameterValue=<VPC_ID>
40+
```
41+
42+
43+
## Clean up
44+
To delete the SAM application deployment, you can use the terminal and enter:
45+
46+
```bash
47+
sam delete
48+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Description: "Setup to monitor sagemaker hyperpod clusters on AWS. Amazon Managed Prometheus and Amazon Manged Grafana workspaces with associated IAM roles are deployed in the AWS Account. Prometheus and exporter services are set up on Cluster Nodes. Author: Matt Nightingale - nghtm@"
3+
4+
5+
Resources:
6+
AmazonGrafanaWorkspaceIAMRole:
7+
Type: 'AWS::IAM::Role'
8+
Properties:
9+
AssumeRolePolicyDocument:
10+
Version: 2012-10-17
11+
Statement:
12+
- Effect: Allow
13+
Principal:
14+
Service:
15+
- grafana.amazonaws.com
16+
Action:
17+
- 'sts:AssumeRole'
18+
RoleName: !Sub ${AWS::StackName}-Grafana-Role
19+
20+
AmazonGrafanaPrometheusPolicy:
21+
Type: AWS::IAM::Policy
22+
Properties:
23+
PolicyName: AmazonGrafana_Prometheus_policy
24+
PolicyDocument:
25+
Version: '2012-10-17'
26+
Statement:
27+
- Effect: Allow
28+
Action:
29+
- aps:ListWorkspaces
30+
- aps:DescribeWorkspace
31+
- aps:QueryMetrics
32+
- aps:GetLabels
33+
- aps:GetSeries
34+
- aps:GetMetricMetadata
35+
Resource: "*"
36+
Roles: [!Ref AmazonGrafanaWorkspaceIAMRole]
37+
38+
AmazonGrafanaWorkspace:
39+
Type: 'AWS::Grafana::Workspace'
40+
Properties:
41+
AccountAccessType: CURRENT_ACCOUNT
42+
Name: !Sub ${AWS::StackName}-Dashboard
43+
Description: Amazon Grafana Workspace to monitor SageMaker Cluster
44+
AuthenticationProviders:
45+
- AWS_SSO
46+
PermissionType: SERVICE_MANAGED
47+
RoleArn: !GetAtt
48+
- AmazonGrafanaWorkspaceIAMRole
49+
- Arn
50+
DataSources: ["CLOUDWATCH","PROMETHEUS"]
51+
OrganizationRoleName: "ADMIN"
52+
53+
APSWorkspace:
54+
Type: AWS::APS::Workspace
55+
Properties:
56+
Alias: !Sub ${AWS::StackName}-Hyperpod-WorkSpace
57+
Tags:
58+
- Key: Name
59+
Value: SageMaker Hyperpod PrometheusMetrics
60+
61+
Outputs:
62+
Region:
63+
Value: !Ref "AWS::Region"
64+
AMPRemoteWriteURL:
65+
Value: !Join ["" , [ !GetAtt APSWorkspace.PrometheusEndpoint , "api/v1/remote_write" ]]
66+
AMPEndPointUrl:
67+
Value: !GetAtt APSWorkspace.PrometheusEndpoint
68+
GrafanWorkspaceURL:
69+
Value: !Join ["" , [ "https://", !GetAtt AmazonGrafanaWorkspace.Endpoint ]]
70+
GrafanWorkspaceId:
71+
Value: !GetAtt AmazonGrafanaWorkspace.Id
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env python3
2+
3+
import boto3
4+
from grafana_client import GrafanaApi, HeaderAuth, TokenAuth
5+
from grafanalib._gen import DashboardEncoder
6+
from grafanalib.core import Dashboard
7+
import json
8+
from typing import Dict
9+
import urllib.request
10+
11+
from grafana_client.knowledge import datasource_factory
12+
from grafana_client.model import DatasourceModel
13+
14+
import os
15+
import cfnresponse
16+
17+
PROM_DASHBOARDS_URL = [
18+
'https://grafana.com/api/dashboards/12239/revisions/latest/download',
19+
'https://grafana.com/api/dashboards/1860/revisions/latest/download',
20+
'https://grafana.com/api/dashboards/20579/revisions/latest/download'
21+
]
22+
23+
24+
def create_prometheus_datasource(grafana, url, aws_region):
25+
jsonData = {
26+
'sigV4Auth': True,
27+
'sigV4AuthType': 'ec2_iam_role',
28+
'sigV4Region': aws_region,
29+
'httpMethod': 'GET'
30+
}
31+
32+
datasource = DatasourceModel(name="Prometheus",
33+
type="prometheus",
34+
url=url,
35+
access="proxy",
36+
jsonData=jsonData)
37+
datasource = datasource_factory(datasource)
38+
datasource = datasource.asdict()
39+
datasource = grafana.datasource.create_datasource(datasource)["datasource"]
40+
r = grafana.datasource.health(datasource['uid'])
41+
return datasource
42+
43+
44+
def encode_dashboard(entity) -> Dict:
45+
"""
46+
Encode grafanalib `Dashboard` entity to dictionary.
47+
48+
TODO: Optimize without going through JSON marshalling.
49+
"""
50+
return json.loads(json.dumps(entity, sort_keys=True, cls=DashboardEncoder))
51+
52+
53+
def mk_dash(datasource_uid, url):
54+
url = urllib.request.urlopen(url)
55+
dashboard = json.load(url)
56+
for i in dashboard['panels']:
57+
i["datasource"] = {"type": "prometheus", "uid": datasource_uid}
58+
59+
for i in dashboard['templating']['list']:
60+
i["datasource"] = {"type": "prometheus", "uid": datasource_uid}
61+
62+
return {"dashboard": dashboard, "overwrite": True}
63+
64+
65+
def lambda_handler(event, context):
66+
67+
if event['RequestType'] != 'Create':
68+
responseData = {}
69+
responseData['Data'] = 0
70+
cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData,
71+
"CustomResourcePhysicalID")
72+
return {'statusCode': 200, 'body': json.dumps('Update or Delete')}
73+
74+
aws_region = os.environ['REGION']
75+
grafana_key_name = "CreateDashboards"
76+
grafana_url = os.environ['GRAFANA_WORKSPACE_URL']
77+
workspace_id = os.environ['GRAFANA_WORKSPACE_ID']
78+
prometheus_url = os.environ['PROMETHEUS_URL']
79+
80+
client = boto3.client('grafana')
81+
try:
82+
response = client.create_workspace_api_key(keyName=grafana_key_name,
83+
keyRole='ADMIN',
84+
secondsToLive=60,
85+
workspaceId=workspace_id)
86+
except Exception as e:
87+
responseData = {}
88+
responseData['Data'] = 123
89+
cfnresponse.send(event, context, cfnresponse.FAILED, responseData,
90+
"CustomResourcePhysicalID")
91+
print(e)
92+
93+
try:
94+
grafana = GrafanaApi.from_url(
95+
url=grafana_url,
96+
credential=TokenAuth(token=response['key']),
97+
)
98+
99+
prometheus_datasource = create_prometheus_datasource(
100+
grafana, prometheus_url, aws_region)
101+
102+
for i in PROM_DASHBOARDS_URL:
103+
dashboard_payload = mk_dash(prometheus_datasource['uid'], i)
104+
response = grafana.dashboard.update_dashboard(dashboard_payload)
105+
106+
responseData = {}
107+
responseData['Data'] = 123
108+
cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData,
109+
"CustomResourcePhysicalID")
110+
111+
except Exception as e:
112+
responseData = {}
113+
responseData['Data'] = 123
114+
cfnresponse.send(event, context, cfnresponse.FAILED, responseData,
115+
"CustomResourcePhysicalID")
116+
print(e)
117+
118+
response = client.delete_workspace_api_key(keyName=grafana_key_name,
119+
workspaceId=workspace_id)
120+
121+
return {'statusCode': 200, 'body': json.dumps('Dashboards created')}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
certifi
2+
grafana-client==4.3.2
3+
grafanalib==0.7.1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Transform: AWS::Serverless-2016-10-31
3+
Description: >
4+
Prometheus Agent Collector
5+
6+
7+
Parameters:
8+
PCClusterName:
9+
Type: String
10+
SubnetId:
11+
Type: AWS::EC2::Subnet::Id
12+
VpcId:
13+
Type: AWS::EC2::VPC::Id
14+
15+
Resources:
16+
GrafanaPrometheus:
17+
Type: AWS::Serverless::Application
18+
DeletionPolicy: Delete
19+
UpdateReplacePolicy: Delete
20+
Properties:
21+
Location: cluster-observability.yaml
22+
23+
GrafanaLambdaRole:
24+
Type: AWS::IAM::Role
25+
Properties:
26+
AssumeRolePolicyDocument:
27+
Version: "2012-10-17"
28+
Statement:
29+
- Effect: Allow
30+
Principal:
31+
Service: lambda.amazonaws.com
32+
Action: "sts:AssumeRole"
33+
Policies:
34+
- PolicyName: GrafanaLambda
35+
PolicyDocument:
36+
Version: "2012-10-17"
37+
Statement:
38+
- Effect: Allow
39+
Action:
40+
- 'grafana:CreateWorkspaceApiKey'
41+
- 'grafana:DeleteWorkspaceApiKey'
42+
Resource: "*"
43+
44+
DashboardCreationLambda:
45+
Type: AWS::Serverless::Function
46+
Properties:
47+
CodeUri: dashboards
48+
Environment:
49+
Variables:
50+
REGION: !Ref AWS::Region
51+
PROMETHEUS_URL: !GetAtt GrafanaPrometheus.Outputs.AMPEndPointUrl
52+
GRAFANA_WORKSPACE_ID: !GetAtt GrafanaPrometheus.Outputs.GrafanWorkspaceId
53+
GRAFANA_WORKSPACE_URL: !GetAtt GrafanaPrometheus.Outputs.GrafanWorkspaceURL
54+
Handler: create_ml_dashboards.lambda_handler
55+
Runtime: python3.13
56+
Role: !GetAtt GrafanaLambdaRole.Arn
57+
Timeout: 10
58+
MemorySize: 128
59+
Tags:
60+
Application: MLDashboards
61+
62+
PrometheusCollector:
63+
Type: AWS::Serverless::Application
64+
DeletionPolicy: Delete
65+
UpdateReplacePolicy: Delete
66+
Properties:
67+
Location: prometheus-agent-collector.yaml
68+
Parameters:
69+
PCClusterNAME: !Ref PCClusterName
70+
ManagedPrometheusUrl: !GetAtt GrafanaPrometheus.Outputs.AMPRemoteWriteURL
71+
SubnetId: !Ref SubnetId
72+
VpcId: !Ref VpcId
73+
74+
LambdaTrigger:
75+
Type: "Custom::LambdaTrigger"
76+
Properties:
77+
ServiceToken:
78+
!GetAtt DashboardCreationLambda.Arn
79+
ServiceTimeout: 300

0 commit comments

Comments
 (0)