diff --git a/joss.06017/10.21105.joss.06017.crossref.xml b/joss.06017/10.21105.joss.06017.crossref.xml new file mode 100644 index 0000000000..20521ed5d0 --- /dev/null +++ b/joss.06017/10.21105.joss.06017.crossref.xml @@ -0,0 +1,209 @@ + + + + 20240510T112720-9d0f5cd6814ee5bb5c70c3b84af6ea83a14fce77 + 20240510112720 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 05 + 2024 + + + 9 + + 97 + + + + SCAS dashboard: A tool to intuitively and interactively +analyze Slurm cluster usage + + + + Thomas + Walzthoeni + https://orcid.org/0009-0009-3995-709X + + + Bom Bahadur + Singiali + + + N. William + Rayner + https://orcid.org/0000-0003-0510-4792 + + + Francesco Paolo + Casale + + + Christoph + Feest + https://orcid.org/0000-0002-0772-7267 + + + Carsten + Marr + https://orcid.org/0000-0003-2154-4552 + + + Alf + Wachsmann + https://orcid.org/0000-0002-7736-3059 + + + + 05 + 10 + 2024 + + + 6017 + + + 10.21105/joss.06017 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.10064783 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/6017 + + + + 10.21105/joss.06017 + https://joss.theoj.org/papers/10.21105/joss.06017 + + + https://joss.theoj.org/papers/10.21105/joss.06017.pdf + + + + + + SLURM Dashboard + Dessalvi + 2021 + Dessalvi, M. (2021). SLURM Dashboard. +https://grafana.com/grafana/dashboards/4323. + + + R: A language and environment for statistical +computing + R Core Team + 2023 + R Core Team. (2023). R: A language +and environment for statistical computing. R Foundation for Statistical +Computing. https://www.R-project.org/ + + + Shiny: Web application framework for +r + Chang + 2023 + Chang, W., Cheng, J., Allaire, J., +Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, +A., & Borges, B. (2023). Shiny: Web application framework for +r. + + + Shinydashboard: Create dashboards with +’shiny’ + Chang + 2021 + Chang, W., & Borges Ribeiro, B. +(2021). Shinydashboard: Create dashboards with ’shiny’. +http://rstudio.github.io/shinydashboard/ + + + SLURM: Simple linux utility for resource +management + Yoo + Job scheduling strategies for parallel +processing + 10.1007/10968987_3 + 978-3-540-39727-4 + 2003 + Yoo, A. B., Jette, M. A., & +Grondona, M. (2003). SLURM: Simple linux utility for resource +management. In D. Feitelson, L. Rudolph, & U. Schwiegelshohn (Eds.), +Job scheduling strategies for parallel processing (pp. 44–60). Springer +Berlin Heidelberg. +https://doi.org/10.1007/10968987_3 + + + Open XDMoD: A tool for the comprehensive +management of high-performance computing resources + Palmer + Computing in Science & +Engineering + 4 + 17 + 10.1109/MCSE.2015.68 + 2015 + Palmer, J. T., Gallo, S. M., Furlani, +T. R., Jones, M. D., DeLeon, R. L., White, J. P., Simakov, N., Patra, A. +K., Sperhac, J., Yearke, T., Rathsam, R., Innus, M., Cornelius, C. D., +Browne, J. C., Barth, W. L., & Evans, R. T. (2015). Open XDMoD: A +tool for the comprehensive management of high-performance computing +resources. Computing in Science & Engineering, 17(4), 52–62. +https://doi.org/10.1109/MCSE.2015.68 + + + Open OnDemand: A web-based client portal for +HPC centers + Hudak + Journal of Open Source +Software + 25 + 3 + 10.21105/joss.00622 + 2018 + Hudak, D., Johnson, D., Chalker, A., +Nicklas, J., Franz, E., Dockendorf, T., & McMichael, B. L. (2018). +Open OnDemand: A web-based client portal for HPC centers. Journal of +Open Source Software, 3(25), 622. +https://doi.org/10.21105/joss.00622 + + + + + + diff --git a/joss.06017/10.21105.joss.06017.pdf b/joss.06017/10.21105.joss.06017.pdf new file mode 100644 index 0000000000..82f7dac43e Binary files /dev/null and b/joss.06017/10.21105.joss.06017.pdf differ diff --git a/joss.06017/paper.jats/10.21105.joss.06017.jats b/joss.06017/paper.jats/10.21105.joss.06017.jats new file mode 100644 index 0000000000..e129548162 --- /dev/null +++ b/joss.06017/paper.jats/10.21105.joss.06017.jats @@ -0,0 +1,513 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +6017 +10.21105/joss.06017 + +SCAS dashboard: A tool to intuitively and interactively +analyze Slurm cluster usage + + + +https://orcid.org/0009-0009-3995-709X + +Walzthoeni +Thomas + + +* + + + +Singiali +Bom Bahadur + + + + +https://orcid.org/0000-0003-0510-4792 + +Rayner +N. William + + + + + +Casale +Francesco Paolo + + + + + + +https://orcid.org/0000-0002-0772-7267 + +Feest +Christoph + + + + + +https://orcid.org/0000-0003-2154-4552 + +Marr +Carsten + + + + +https://orcid.org/0000-0002-7736-3059 + +Wachsmann +Alf + + + + + +Core Facility Genomics, Helmholtz Zentrum München - German +Research Center for Environmental Health, 85764 Neuherberg, +Germany + + + + +Digital Transformation & IT, Helmholtz Munich, +Helmholtz Zentrum München - German Research Center for Environmental +Health, 85764 Neuherberg, Germany + + + + +Institute of Translational Genomics, Helmholtz Zentrum +München - German Research Center for Environmental Health, 85764 +Neuherberg, Germany + + + + +Computational Health Center, Helmholtz Zentrum München - +German Research Center for Environmental Health, 85764 Neuherberg, +Germany + + + + +Helmholtz Pioneer Campus, Helmholtz Zentrum München - +German Research Center for Environmental Health, 85764 Neuherberg, +Germany + + + + +School of Computation, Information and Technology, +Technical University of Munich, Munich, Germany + + + + +Helmholtz AI, Helmholtz Zentrum München - German Research +Center for Environmental Health, 85764 Neuherberg, Germany + + + + +* E-mail: + + +30 +8 +2023 + +9 +97 +6017 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Slurm +HPC +dashboard +python +R +shiny +containers + + + + + + Summary +

Many organizations offer High Performance Computing (HPC) + environments as a service, hosted on-premises or in the cloud. Compute + jobs are commonly managed via Slurm + (Yoo et + al., 2003), but an intuitive, easy-to-use and interactive + visualization has been lacking. To fill this gap, we developed a Slurm + Cluster Admin Statistics (SCAS) dashboard. SCAS provides a means to + analyze and visualize data of compute jobs and includes a feature to + generate presentations for cluster users. It thus allows HPC + stakeholders to easily analyze and identify bottlenecks of Slurm-based + compute clusters in a timely fashion and provides decision-making + support for managing cluster resources.

+
+ + Statement of need +

Slurm + (Yoo et + al., 2003) is an open-source cluster management and job + scheduling system for Linux-based compute clusters and is widely used + for High Performance Computing (HPC). It offers command line tools to + export and analyze cluster use and various applications have been + developed to monitor the current state of the cluster (e.g., live + dashboards using Grafana + (Dessalvi, + 2021)). A feature-rich tool for the analysis of cluster + performance is Open XDMoD + (Palmer + et al., 2015), which supports various schedulers and metrics. + Open XDMoD uses 3rd-party software libraries that are not free for + commercial use. Open OnDemand + (Hudak + et al., 2018) allows users to access a HPC cluster using a web + portal, it provides various apps to facilitate HPC usage and can + integrate the Open XDMoD for usage statistics. Both Open XDMoD and + Open OnDemand require continuous support and extensive configurations + and therefore, intuitive, responsive, easy-to-install and easy-to-use + applications that enable HPC administrators and managers to analyze + and visualize cluster usage in detail and over time are highly + complementary. This information is crucial to identify bottlenecks in + compute clusters and make informed strategic decisions regarding their + future development.

+

To address this, we developed the Slurm Cluster Admin Statistics + (SCAS) dashboard, a scalable and flexible dashboard application to + analyze completed compute jobs on a Slurm-based cluster. The dashboard + offers various statistics, visualizations, and insights to HPC + stakeholders and cluster users. Additionally, we engineered the + software to have a low-memory footprint and to be fast and responsive + to user queries. The software stack is provided in an easy-to-use and + easy-to-deploy manner using docker containers and a docker-compose + implementation.

+
+ + Description + + SCAS Dashboard overview +

The SCAS dashboard architecture consists of a nginx web server as + a router (reverse proxy), a front end based on R-Shiny + (Chang + et al., 2023; + Chang + & Borges Ribeiro, 2021; + R Core + Team, 2023), a back end based on Python using the Django REST + framework to provide an API, and a PostgreSQL database as back end + (see [fig:fig1]). + The dashboard is intended for HPC stakeholders and therefore + includes secure user authentication. The front end is a + user-friendly interface for filtering and visualizing the Slurm + data. The back end provides an admin interface via Python Django + Admin and a web API that is used by both the front end and a script + for uploading new data. Additionally, the back end creates a daily + index of the data, enabling the software to maintain a low memory + footprint while being fast and responsive. Furthermore, a + presentation can be generated automatically and viewed by various + stakeholders, including the cluster users, via a web browser.

+ +

Architecture of the SCAS dashboard. The dashboard and + the presentation are accessed by the user through a web browser. + New data can be uploaded to the SCAS dashboard by executing a + script that regularly fetches the latest data from a job + submission node. On the server side, the architecture is organized + into separate components (shown in dashed box): nginx (reverse + proxy), SCAS-frontend, SCAS-backend and PostgresSQL database. A + docker-compose implementation of the services is provided. +

+ +
+
+ + SCAS dashboard workflow +

Completed compute jobs and available node configurations are + submitted to the SCAS-backend API with a script that utilizes the + Slurm’s sacct tool. This script can be run as a + daily or weekly cron job on a job submission node. + The back end then generates the daily statistics that are stored in + the database. This preprocessed indexed data enables the app to have + a low memory footprint and high responsiveness, as no calculations + are required when the data is fetched from the API. Upon filtering a + date range in the front end, a request is sent to the back end that + retrieves the data for the selected days and aggregates the + statistics to generate the visualizations.

+
+ + Frontend – dashboard user interface +

[fig:fig2] + displays some example views of the user interface. The date range, + the cluster, and the partitions that should be analyzed can be + selected from the menu + (Figure 2a). Data + tables and visualizations are then updated accordingly and displayed + to the user.

+

For the selected date range, the visualizations include:

+ + +

Number of jobs, CPU and GPU hours per month + (Figure + 2b,c)

+
+ +

Memory and cores requested by users, displayed as contingency + graphs

+
+ +

Average job pending and runtimes per month and per day + (Figure + 2d,e,f)

+
+ +

Distribution of CPU hours used vs. the percentage of + users

+
+ +

Total cluster utilization per day and per month, individual + node utilization per month, summaries of utilization per CPU/GPU + or memory type of the nodes per month + (Figure 2g)

+
+
+

The data can also be downloaded for use in spreadsheet + applications.

+
+ + Frontend – automated presentation +

For presenting key figures to the cluster users, a feature is + available to generate a browser-based presentation in carousel mode. + The presentation is auto-updated and customization settings are + available via the admin interface.

+
+ + SCAS dashboard - example use case +

To exemplify an analysis with the SCAS dashboard, we assumed that + users reported longer pending times for GPU resources in recent + months. We have simulated this case by increasing the number of GPU + jobs (and their pending times) for GPU servers, with 16 GPUs, over a + time frame of 1 year. As shown in + Figure 2b,c, the + increase in the number of GPU jobs and CPU hours for the GPU + partition is visible and confirms the assumption. By inspecting the + pending times per day + (Figure 2d), there + is a general, unbiased increase of the pending times visible for the + last few months. From + Figure 2e we can + then see an increase of the pending times for the GPU partition for + the previous 6 months. + Figure 2f shows that + the increase of the pending times is only seen for servers with + >10 GPUs, and the utilization of the nodes with 16 GPUs has + increased while those with 2 and 4 GPUs were stable + (Figure 2g). This + analysis can be used to draw concrete conclusions, in this case, to + either inform the users that resources are available if up to 4 GPUs + are requested, or to make the decision to invest in new GPU servers + to achieve shorter pending times and higher throughput.

+ +

a. User interface of the SCAS Dashboard + featuring navigation and the selection menu. The central panel + displays statistics and graphics. b. Line plot + showing the jobs run per month. c. Line plot showing + the GPU hours per month. d. Heatmap plot showing the + average daily pending times of the jobs. e. Line plot + with the average jobs pending times per month. The positive error + bars indicate the standard deviation. f. Line plot + with the average jobs pending times per month separated by GPU + categories. The positive error bars indicate the standard + deviation. g. Line plot showing the utilization of + nodes with different numbers of GPUs. +

+ +
+
+
+ + Conclusion and Availability +

The SCAS dashboard enables rapid and responsive analysis of + Slurm-based cluster usage. This allows stakeholders: I) to identify + current bottlenecks of CPU and GPU utilization, II) to make informed + decisions to adapt SLURM parameters in the short term, and III) to + support strategic decisions, all based on user needs. The SCAS + dashboard, code, and the documentation are hosted on a publicly + available GitHub repository + (https://github.com/Bioinformatics-Munich/scas_dashboard). + The repository also contains a docker-compose file for rapid + deployment and testing of the software, as well as a program to + generate test data for the dashboard.

+
+ + Acknowledgements +

We acknowledge the Institute of Computational Biology + (Prof. Dr. Dr. Fabian Theis) at Helmholtz Munich for supporting the + development of the software. We thank Dr. Bastian Rieck, Helmholtz + Munich, for valuable contributions and comments to the manuscript.

+
+ + + + + + + DessalviMatteo + + SLURM Dashboard + https://grafana.com/grafana/dashboards/4323 + 2021 + + + + + + R Core Team + + R: A language and environment for statistical computing + R Foundation for Statistical Computing + Vienna, Austria + 2023 + https://www.R-project.org/ + + + + + + ChangWinston + ChengJoe + AllaireJJ + SievertCarson + SchloerkeBarret + XieYihui + AllenJeff + McPhersonJonathan + DipertAlan + BorgesBarbara + + Shiny: Web application framework for r + 2023 + + + + + + ChangWinston + Borges RibeiroBarbara + + Shinydashboard: Create dashboards with ’shiny’ + 2021 + http://rstudio.github.io/shinydashboard/ + + + + + + YooAndy B. + JetteMorris A. + GrondonaMark + + SLURM: Simple linux utility for resource management + Job scheduling strategies for parallel processing + + FeitelsonDror + RudolphLarry + SchwiegelshohnUwe + + Springer Berlin Heidelberg + Berlin, Heidelberg + 2003 + 978-3-540-39727-4 + 10.1007/10968987_3 + 44 + 60 + + + + + + PalmerJeffrey T. + GalloSteven M. + FurlaniThomas R. + JonesMatthew D. + DeLeonRobert L. + WhiteJoseph P. + SimakovNikolay + PatraAbani K. + SperhacJeanette + YearkeThomas + RathsamRyan + InnusMartins + CorneliusCynthia D. + BrowneJames C. + BarthWilliam L. + EvansRichard T. + + Open XDMoD: A tool for the comprehensive management of high-performance computing resources + Computing in Science & Engineering + 2015 + 17 + 4 + 10.1109/MCSE.2015.68 + 52 + 62 + + + + + + HudakDave + JohnsonDoug + ChalkerAlan + NicklasJeremy + FranzEric + DockendorfTrey + McMichaelBrian L. + + Open OnDemand: A web-based client portal for HPC centers + Journal of Open Source Software + The Open Journal + 2018 + 3 + 25 + https://doi.org/10.21105/joss.00622 + 10.21105/joss.00622 + 622 + + + + + +
diff --git a/joss.06017/paper.jats/Figure1.png b/joss.06017/paper.jats/Figure1.png new file mode 100644 index 0000000000..19cc2f7a5f Binary files /dev/null and b/joss.06017/paper.jats/Figure1.png differ diff --git a/joss.06017/paper.jats/Figure2.png b/joss.06017/paper.jats/Figure2.png new file mode 100644 index 0000000000..13d3ea1d88 Binary files /dev/null and b/joss.06017/paper.jats/Figure2.png differ