@@ -70,14 +70,14 @@ Slurm [@slurm] is an open-source cluster management and job
70
70
scheduling system for Linux-based compute clusters and is widely used
71
71
for High Performance Computing (HPC). It offers command line tools to
72
72
export and analyze cluster use and various applications have been
73
- developed to monitor the current state of the cluster (e.g. live
74
- dashboards using Grafana [ @grafanadb ] ). A feature rich tool for the
75
- analysis of cluster performance is Open XDMoD [ @xdmod ] that supports
76
- various schedulers and metrics. Open XDMoD uses 3^rd^ party software
73
+ developed to monitor the current state of the cluster (e.g., live
74
+ dashboards using Grafana [ @grafanadb ] ). A feature- rich tool for the
75
+ analysis of cluster performance is Open XDMoD [ @xdmod ] , which supports
76
+ various schedulers and metrics. Open XDMoD uses 3rd- party software
77
77
libraries that are not free for commercial use. Open OnDemand
78
78
[ @Hudak2018 ] allows users to access a HPC cluster using a web portal,
79
79
it provides various apps to facilitate HPC usage and can integrate the
80
- Open XDMoD for usage statistics. Both, Open XDMoD and Open OnDemand
80
+ Open XDMoD for usage statistics. Both Open XDMoD and Open OnDemand
81
81
require continuous support and extensive configurations and therefore,
82
82
intuitive, responsive, easy-to-install and easy-to-use applications that
83
83
enable HPC administrators and managers to analyze and visualize cluster
@@ -99,15 +99,15 @@ using docker containers and a docker-compose implementation.
99
99
## SCAS Dashboard overview
100
100
101
101
The SCAS dashboard architecture consists of a nginx web server as a
102
- router (reverse proxy), a frontend based on R-Shiny [ @shiny ; @R ;
103
- @shinydashboard ] , a backend based on Python using the Django REST
104
- framework to provide an API, and a PostgreSQL database as backend (see
105
- \autoref{fig: fig1 }). The dashboard is intended for the HPC stakeholders
106
- and therefore includes secure user authentication. The frontend is a
102
+ router (reverse proxy), a front end based on R-Shiny [ @shiny ; @R ;
103
+ @shinydashboard ] , a back end based on Python using the Django REST
104
+ framework to provide an API, and a PostgreSQL database as back end (see
105
+ \autoref{fig: fig1 }). The dashboard is intended for HPC stakeholders
106
+ and therefore includes secure user authentication. The front end is a
107
107
user-friendly interface for filtering and visualizing the Slurm data.
108
- The backend provides an admin interface via Python Django Admin and a
109
- web API that is used by both the frontend and a script for uploading new
110
- data. Additionally, the backend creates a daily index of the data,
108
+ The back end provides an admin interface via Python Django Admin and a
109
+ web API that is used by both the front end and a script for uploading new
110
+ data. Additionally, the back end creates a daily index of the data,
111
111
enabling the software to maintain a low memory footprint while being
112
112
fast and responsive. Furthermore, a presentation can be generated
113
113
automatically and viewed by various stakeholders, including the cluster
@@ -128,12 +128,12 @@ width=100% }
128
128
Completed compute jobs and available node configurations are submitted
129
129
to the SCAS-backend API with a script that utilizes the Slurm's * sacct*
130
130
tool. This script can be run as a daily or weekly * cron* job on a job
131
- submission node. The backend then generates the daily statistics that
131
+ submission node. The back end then generates the daily statistics that
132
132
are stored in the database. This preprocessed indexed data enables the
133
133
app to have a low memory footprint and high responsiveness, as no
134
134
calculations are required when the data is fetched from the API. Upon
135
- filtering a date range in the frontend , a request is sent to the backend
136
- which retrieves the data for the selected days and aggregates the
135
+ filtering a date range in the front end , a request is sent to the back end
136
+ that retrieves the data for the selected days and aggregates the
137
137
statistics to generate the visualizations.
138
138
139
139
## Frontend -- dashboard user interface
@@ -179,15 +179,15 @@ pending times) for GPU servers, with 16 GPUs, over a time frame of 1
179
179
year. As shown in \hyperref[ fig: fig2 ] {Figure 2b,c}, the increase in
180
180
the number of GPU jobs and CPU hours for the GPU partition is visible
181
181
and confirms the assumption. By inspecting the pending times per day
182
- (\hyperref[ fig: fig2 ] {Figure 2d}) there is a general, unbiased
182
+ (\hyperref[ fig: fig2 ] {Figure 2d}), there is a general, unbiased
183
183
increase of the pending times visible for the last few months. From
184
184
\hyperref[ fig: fig2 ] {Figure 2e} we can then see an increase of the
185
185
pending times for the GPU partition for the previous 6 months.
186
186
\hyperref[ fig: fig2 ] {Figure 2f} shows that the increase of the pending
187
187
times is only seen for servers with >10 GPUs, and the utilization of
188
188
the nodes with 16 GPUs has increased while those with 2 and 4 GPUs were
189
189
stable (\hyperref[ fig: fig2 ] {Figure 2g}). This analysis can be used to
190
- draw concrete conclusions. In this case to either inform the users that
190
+ draw concrete conclusions, in this case, to either inform the users that
191
191
resources are available if up to 4 GPUs are requested, or to make the
192
192
decision to invest in new GPU servers to achieve shorter pending times
193
193
and higher throughput.
@@ -209,7 +209,7 @@ plot showing the utilization of nodes with different numbers of GPUs.
209
209
The SCAS dashboard enables rapid and responsive analysis of Slurm-based
210
210
cluster usage. This allows stakeholders: I) to identify current
211
211
bottlenecks of CPU and GPU utilization, II) to make informed decisions
212
- to adapt SLURM parameters in the short term and III) to support
212
+ to adapt SLURM parameters in the short term, and III) to support
213
213
strategic decisions, all based on user needs. The SCAS dashboard, code,
214
214
and the documentation are hosted on a publicly available GitHub
215
215
repository (< https://github.com/Bioinformatics-Munich/scas_dashboard > ).
0 commit comments