Skip to content

Commit 4958264

Browse files
committed
Add 1click grafana dashboard deployment
1 parent 4039347 commit 4958264

File tree

7 files changed

+750
-0
lines changed

7 files changed

+750
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Machine Learning Infrastructure Monitoring for Slurm based cluster <!-- omit from toc -->
2+
3+
This solution provides a "1 click" deployment observability stack to monitor your slurm based machine learning infrastructure. It automatically creates:
4+
- Amazon Managed Grafana
5+
- Amazon Managed Prometheus
6+
- Setup a Prometheus Agent Collector
7+
- Create data source and dashboard in Grafana
8+
9+
10+
## Prerequisites
11+
12+
Install AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version **>=1.135.0** by following the [instructions](<https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html>)
13+
14+
15+
## Architecture
16+
TBD
17+
18+
## Deploy
19+
You will begin by installing the necessary Python package needed for the lambda function.
20+
In your shell:
21+
22+
```bash
23+
cd dashboards
24+
pip install -r requirements.txt -t .
25+
cd ..
26+
```
27+
28+
You are now ready to deploy the serverless application, run the following in your shell:
29+
30+
```
31+
OBS_DASHBOARD_NAME="ml-obs-dashboard"
32+
sam build -t managed-cluster-observability-pc.yaml
33+
sam deploy -t managed-cluster-observability-pc.yaml\
34+
--stack-name ${OBS_DASHBOARD_NAME} \
35+
--guided \
36+
--capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
37+
--parameter-overrides \
38+
ParameterKey=PCClusterName,ParameterValue=<CLUSTER_NAME> \
39+
ParameterKey=SubnetId,ParameterValue=<SUBNET_ID> \
40+
ParameterKey=VpcId,ParameterValue=<VPC_ID>
41+
```
42+
43+
44+
## Clean up
45+
To delete the SAM application deployment, you can use the terminal and enter:
46+
47+
```bash
48+
sam delete
49+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Description: "Setup to monitor sagemaker hyperpod clusters on AWS. Amazon Managed Prometheus and Amazon Manged Grafana workspaces with associated IAM roles are deployed in the AWS Account. Prometheus and exporter services are set up on Cluster Nodes. Author: Matt Nightingale - nghtm@"
3+
4+
5+
Resources:
6+
AmazonGrafanaWorkspaceIAMRole:
7+
Type: 'AWS::IAM::Role'
8+
Properties:
9+
AssumeRolePolicyDocument:
10+
Version: 2012-10-17
11+
Statement:
12+
- Effect: Allow
13+
Principal:
14+
Service:
15+
- grafana.amazonaws.com
16+
Action:
17+
- 'sts:AssumeRole'
18+
ManagedPolicyArns:
19+
- arn:aws:iam::aws:policy/service-role/AmazonGrafanaCloudWatchAccess
20+
RoleName: !Sub ${AWS::StackName}-Grafana-Role
21+
22+
AmazonGrafanaPrometheusPolicy:
23+
Type: AWS::IAM::Policy
24+
Properties:
25+
PolicyName: AmazonGrafana_Prometheus_policy
26+
PolicyDocument:
27+
Version: '2012-10-17'
28+
Statement:
29+
- Effect: Allow
30+
Action:
31+
- aps:ListWorkspaces
32+
- aps:DescribeWorkspace
33+
- aps:QueryMetrics
34+
- aps:GetLabels
35+
- aps:GetSeries
36+
- aps:GetMetricMetadata
37+
Resource: "*"
38+
Roles: [!Ref AmazonGrafanaWorkspaceIAMRole]
39+
40+
AmazonGrafanaWorkspace:
41+
Type: 'AWS::Grafana::Workspace'
42+
Properties:
43+
AccountAccessType: CURRENT_ACCOUNT
44+
Name: !Sub ${AWS::StackName}-Dashboard
45+
Description: Amazon Grafana Workspace to monitor SageMaker Cluster
46+
AuthenticationProviders:
47+
- AWS_SSO
48+
PermissionType: SERVICE_MANAGED
49+
RoleArn: !GetAtt
50+
- AmazonGrafanaWorkspaceIAMRole
51+
- Arn
52+
DataSources: ["CLOUDWATCH","PROMETHEUS"]
53+
OrganizationRoleName: "ADMIN"
54+
55+
APSWorkspace:
56+
Type: AWS::APS::Workspace
57+
Properties:
58+
Alias: !Sub ${AWS::StackName}-Hyperpod-WorkSpace
59+
Tags:
60+
- Key: Name
61+
Value: SageMaker Hyperpod PrometheusMetrics
62+
63+
Outputs:
64+
Region:
65+
Value: !Ref "AWS::Region"
66+
AMPRemoteWriteURL:
67+
Value: !Join ["" , [ !GetAtt APSWorkspace.PrometheusEndpoint , "api/v1/remote_write" ]]
68+
AMPEndPointUrl:
69+
Value: !GetAtt APSWorkspace.PrometheusEndpoint
70+
GrafanWorkspaceURL:
71+
Value: !Join ["" , [ "https://", !GetAtt AmazonGrafanaWorkspace.Endpoint ]]
72+
GrafanWorkspaceId:
73+
Value: !GetAtt AmazonGrafanaWorkspace.Id
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env python3
2+
3+
import boto3
4+
from grafana_client import GrafanaApi, HeaderAuth, TokenAuth
5+
from grafanalib._gen import DashboardEncoder
6+
from grafanalib.core import Dashboard
7+
import json
8+
from typing import Dict
9+
import urllib.request
10+
11+
from grafana_client.knowledge import datasource_factory
12+
from grafana_client.model import DatasourceModel
13+
14+
import os
15+
import cfnresponse
16+
17+
PROM_DASHBOARDS_URL = [
18+
'https://grafana.com/api/dashboards/12239/revisions/latest/download',
19+
'https://grafana.com/api/dashboards/1860/revisions/latest/download',
20+
'https://grafana.com/api/dashboards/20579/revisions/latest/download'
21+
]
22+
23+
24+
def create_prometheus_datasource(grafana, url, aws_region):
25+
jsonData = {
26+
'sigV4Auth': True,
27+
'sigV4AuthType': 'ec2_iam_role',
28+
'sigV4Region': aws_region,
29+
'httpMethod': 'GET'
30+
}
31+
32+
datasource = DatasourceModel(name="Prometheus",
33+
type="prometheus",
34+
url=url,
35+
access="proxy",
36+
jsonData=jsonData)
37+
datasource = datasource_factory(datasource)
38+
datasource = datasource.asdict()
39+
datasource = grafana.datasource.create_datasource(datasource)["datasource"]
40+
r = grafana.datasource.health(datasource['uid'])
41+
return datasource
42+
43+
44+
def encode_dashboard(entity) -> Dict:
45+
"""
46+
Encode grafanalib `Dashboard` entity to dictionary.
47+
48+
TODO: Optimize without going through JSON marshalling.
49+
"""
50+
return json.loads(json.dumps(entity, sort_keys=True, cls=DashboardEncoder))
51+
52+
53+
def mk_dash(datasource_uid, url):
54+
url = urllib.request.urlopen(url)
55+
dashboard = json.load(url)
56+
for i in dashboard['panels']:
57+
i["datasource"] = {"type": "prometheus", "uid": datasource_uid}
58+
59+
for i in dashboard['templating']['list']:
60+
i["datasource"] = {"type": "prometheus", "uid": datasource_uid}
61+
62+
return {"dashboard": dashboard, "overwrite": True}
63+
64+
65+
def lambda_handler(event, context):
66+
67+
if event['RequestType'] != 'Create':
68+
responseData = {}
69+
responseData['Data'] = 0
70+
cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData,
71+
"CustomResourcePhysicalID")
72+
return {'statusCode': 200, 'body': json.dumps('Update or Delete')}
73+
74+
aws_region = os.environ['REGION']
75+
grafana_key_name = "CreateDashboards"
76+
grafana_url = os.environ['GRAFANA_WORKSPACE_URL']
77+
workspace_id = os.environ['GRAFANA_WORKSPACE_ID']
78+
prometheus_url = os.environ['PROMETHEUS_URL']
79+
80+
client = boto3.client('grafana')
81+
try:
82+
response = client.create_workspace_api_key(keyName=grafana_key_name,
83+
keyRole='ADMIN',
84+
secondsToLive=60,
85+
workspaceId=workspace_id)
86+
except Exception as e:
87+
responseData = {}
88+
responseData['Data'] = 123
89+
cfnresponse.send(event, context, cfnresponse.FAILED, responseData,
90+
"CustomResourcePhysicalID")
91+
print(e)
92+
93+
try:
94+
grafana = GrafanaApi.from_url(
95+
url=grafana_url,
96+
credential=TokenAuth(token=response['key']),
97+
)
98+
99+
prometheus_datasource = create_prometheus_datasource(
100+
grafana, prometheus_url, aws_region)
101+
102+
for i in PROM_DASHBOARDS_URL:
103+
dashboard_payload = mk_dash(prometheus_datasource['uid'], i)
104+
response = grafana.dashboard.update_dashboard(dashboard_payload)
105+
106+
responseData = {}
107+
responseData['Data'] = 123
108+
cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData,
109+
"CustomResourcePhysicalID")
110+
111+
except Exception as e:
112+
responseData = {}
113+
responseData['Data'] = 123
114+
cfnresponse.send(event, context, cfnresponse.FAILED, responseData,
115+
"CustomResourcePhysicalID")
116+
print(e)
117+
118+
response = client.delete_workspace_api_key(keyName=grafana_key_name,
119+
workspaceId=workspace_id)
120+
121+
return {'statusCode': 200, 'body': json.dumps('Dashboards created')}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
certifi
2+
grafana-client==4.3.2
3+
grafanalib==0.7.1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Transform: AWS::Serverless-2016-10-31
3+
Description: >
4+
Grafana Dashboards deployment
5+
6+
7+
Parameters:
8+
PCClusterName:
9+
Type: String
10+
SubnetId:
11+
Type: AWS::EC2::Subnet::Id
12+
VpcId:
13+
Type: AWS::EC2::VPC::Id
14+
15+
Resources:
16+
GrafanaPrometheus:
17+
Type: AWS::Serverless::Application
18+
DeletionPolicy: Delete
19+
UpdateReplacePolicy: Delete
20+
Properties:
21+
Location: cluster-observability.yaml
22+
23+
GrafanaLambdaRole:
24+
Type: AWS::IAM::Role
25+
Properties:
26+
AssumeRolePolicyDocument:
27+
Version: "2012-10-17"
28+
Statement:
29+
- Effect: Allow
30+
Principal:
31+
Service: lambda.amazonaws.com
32+
Action: "sts:AssumeRole"
33+
Policies:
34+
- PolicyName: GrafanaLambda
35+
PolicyDocument:
36+
Version: "2012-10-17"
37+
Statement:
38+
- Effect: Allow
39+
Action:
40+
- 'grafana:CreateWorkspaceApiKey'
41+
- 'grafana:DeleteWorkspaceApiKey'
42+
Resource: "*"
43+
44+
DashboardCreationLambda:
45+
Type: AWS::Serverless::Function
46+
Properties:
47+
CodeUri: dashboards
48+
Environment:
49+
Variables:
50+
REGION: !Ref AWS::Region
51+
PROMETHEUS_URL: !GetAtt GrafanaPrometheus.Outputs.AMPEndPointUrl
52+
GRAFANA_WORKSPACE_ID: !GetAtt GrafanaPrometheus.Outputs.GrafanWorkspaceId
53+
GRAFANA_WORKSPACE_URL: !GetAtt GrafanaPrometheus.Outputs.GrafanWorkspaceURL
54+
Handler: create_ml_dashboards.lambda_handler
55+
Runtime: python3.13
56+
Role: !GetAtt GrafanaLambdaRole.Arn
57+
Timeout: 10
58+
MemorySize: 128
59+
Tags:
60+
Application: MLDashboards
61+
62+
PrometheusCollector:
63+
Type: AWS::Serverless::Application
64+
DeletionPolicy: Delete
65+
UpdateReplacePolicy: Delete
66+
Properties:
67+
Location: prometheus-agent-collector.yaml
68+
Parameters:
69+
PCClusterNAME: !Ref PCClusterName
70+
ManagedPrometheusUrl: !GetAtt GrafanaPrometheus.Outputs.AMPRemoteWriteURL
71+
SubnetId: !Ref SubnetId
72+
VpcId: !Ref VpcId
73+
74+
LambdaTrigger:
75+
Type: "Custom::LambdaTrigger"
76+
Properties:
77+
ServiceToken:
78+
!GetAtt DashboardCreationLambda.Arn
79+
ServiceTimeout: 300

0 commit comments

Comments
 (0)