Skip to content

Latest commit

 

History

History
49 lines (36 loc) · 1.47 KB

File metadata and controls

49 lines (36 loc) · 1.47 KB

Machine Learning Infrastructure Monitoring for Slurm based cluster

This solution provides a "1 click" deployment observability stack to monitor your slurm based machine learning infrastructure. It automatically creates:

  • Amazon Managed Grafana
  • Amazon Managed Prometheus
  • Setup a Prometheus Agent Collector
  • Create data source and dashboard in Grafana

Prerequisites

Install AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version >=1.135.0 by following the instructions

Architecture

TBD

Deploy

You will begin by installing the necessary Python package needed for the lambda function. In your shell:

cd dashboards
pip install -r requirements.txt -t .
cd ..

You are now ready to deploy the serverless application, run the following in your shell:

OBS_DASHBOARD_NAME="ml-obs-dashboard"
sam build -t managed-cluster-observability-pc.yaml
sam deploy -t managed-cluster-observability-pc.yaml\
    --stack-name ${OBS_DASHBOARD_NAME} \
    --guided \
    --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
    --parameter-overrides \
    ParameterKey=PCClusterName,ParameterValue=<CLUSTER_NAME> \
    ParameterKey=SubnetId,ParameterValue=<SUBNET_ID> \
    ParameterKey=VpcId,ParameterValue=<VPC_ID>

Clean up

To delete the SAM application deployment, you can use the terminal and enter:

sam delete