Skip to content

feat: measuring compute efficiency per job #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 108 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 106 commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
b4235fb
feat: writing error to stderr
cmeesters Jan 22, 2025
27b8610
feat: added code to report job efficiency - to be tested
cmeesters Mar 9, 2025
7e6b369
feat: calling the efficiency evaluator - needs to be tested
cmeesters Mar 10, 2025
ed1beaf
docs: rework contributions section
dlaehnemann Mar 13, 2025
c50d28e
docs: some initial restructuring
dlaehnemann Mar 13, 2025
7c0e434
docs: first step of reworking resource specification
dlaehnemann Mar 14, 2025
d947665
docs: more resource specification cleanup
dlaehnemann Mar 14, 2025
bd3d900
docs: move example to the front (not yet updated, but with TODO)
dlaehnemann Mar 14, 2025
15f7892
docs: move retries to configurations section
dlaehnemann Mar 14, 2025
f61ab7c
Merge branch 'main' into docs/review-new-docs
cmeesters Mar 14, 2025
f55f8b0
fix: merge conflict
cmeesters Mar 14, 2025
2861539
Merge branch 'main' into docs/review-new-docs
cmeesters Mar 15, 2025
474ead0
docs: minor re-wording job types
dlaehnemann Mar 17, 2025
6da5674
Merge branch 'docs/review-new-docs' of github.com:snakemake/snakemake…
dlaehnemann Mar 17, 2025
7a51cf9
docs: linkouts and wording in configuration sections (where, dynamic)
dlaehnemann Mar 24, 2025
0a3c17b
docs: slurm_partition linkout and wording
dlaehnemann Mar 24, 2025
9c8ffc0
docs: add gpu to standard resources table
dlaehnemann Mar 24, 2025
8faafba
docs: rework of most slurm-specific resources
dlaehnemann Mar 24, 2025
b78baeb
docs: initial wait times and frequencies changes
dlaehnemann Mar 24, 2025
7f49ee5
Merge branch 'main' into docs/review-new-docs
cmeesters Mar 28, 2025
c53937c
Merge branch 'main' into docs/review-new-docs
cmeesters Mar 31, 2025
b3f80d8
Merge branch 'main' into docs/review-new-docs
cmeesters Apr 4, 2025
9f8ed3e
docs: further clarfiy where to set resources and configurations
dlaehnemann Apr 11, 2025
dba70d8
docs: update wait times and frequencies section (still contains TODOs)
dlaehnemann Apr 11, 2025
d69a0b1
docs: rework "retry failed jobs" section as fas as possible, leaving …
dlaehnemann Apr 11, 2025
be45e7b
docs: remove example and explanation section for resource configurati…
dlaehnemann Apr 11, 2025
bf18d2b
docs: remove explanation of regular SMP jobs, as this does not seem u…
dlaehnemann Apr 11, 2025
8f5de00
docs: big TODO for MPI config
dlaehnemann Apr 11, 2025
bb4c85e
Merge branch 'main' into docs/review-new-docs
cmeesters Apr 11, 2025
bc1f024
fix: corrected description for 'tasks'
cmeesters Apr 28, 2025
8d0f20f
docs: added referal to local rules main docs
cmeesters Apr 28, 2025
11862ed
fix: updating and restoring previous code base / merge conflicts
cmeesters May 5, 2025
4b671e3
feat: added helper function to calculate efficiency report and colori…
cmeesters May 5, 2025
9361226
Merge branch 'main' into docs/review-new-docs
cmeesters May 7, 2025
3cac42c
Merge branch 'main' into feat/eff-report
cmeesters May 8, 2025
317a2af
fix: formatting
cmeesters May 8, 2025
f59d03d
feat: allowing for multi argument messages in the colorization string
cmeesters May 8, 2025
641beb3
fix: formatting
cmeesters May 8, 2025
8baba42
Merge branch 'main' into docs/review-new-docs
cmeesters May 22, 2025
92c3531
feat: using the latest announcement bot action version
cmeesters May 22, 2025
8321d29
fix: removing todo item on '--executor slurm', adding the flag to the…
cmeesters May 22, 2025
5c21a1b
fix: capitalizing all nouns in header lines for consistency
cmeesters May 22, 2025
5686272
fix: removed TODO item in the MPI section. Yes, the exectuor is has n…
cmeesters May 22, 2025
8007365
fix: removed TODO items from the requeu section - hopefully a little …
cmeesters May 22, 2025
4cfac88
fix: reduced TODO items from the status check section: see note in th…
cmeesters May 22, 2025
aca969c
fix: removed additional quotes from slurm_extra example - not necessa…
cmeesters May 22, 2025
7cf6c86
feat: added section on interactive jobs
cmeesters May 22, 2025
bfbc3a6
fix: typo, missing line break
cmeesters May 22, 2025
545f00d
fix: added missing colum with the --gpus flag
cmeesters May 22, 2025
2cc48a3
Merge branch 'docs/review-new-docs' into feat/eff-report
cmeesters May 22, 2025
c4a1385
fix: update for announcement robot
cmeesters May 25, 2025
dd6e897
fix: announcement workflow working with latest version
cmeesters May 26, 2025
a9f1ceb
fix: efficiency report NOT written to slurm logdir - was a very bad idea
cmeesters May 26, 2025
c38d11e
fix: renamed function and passing sacct command through shlex
cmeesters May 26, 2025
548eba6
Merge branch 'main' into feat/eff-report
cmeesters May 26, 2025
acee2a6
feat: protecting against zero division
cmeesters May 26, 2025
a2ce419
feat: consistent time parsing
cmeesters May 26, 2025
212143d
fix: correct memory conversion for multi node jobs
cmeesters May 26, 2025
1f31fc0
fix: blacked
cmeesters May 26, 2025
22f8715
Merge branch 'feat/eff-report' of github.com:snakemake/snakemake-exec…
cmeesters May 26, 2025
a48eea5
fix: merge conflict
cmeesters May 26, 2025
22c3244
docs: added documentation for this new feature
cmeesters May 26, 2025
4ff426f
test: attempt to add test case
cmeesters May 26, 2025
a3a1f34
fix: blacked
cmeesters May 26, 2025
a265e8c
fix: attempt to install pandas for testing
cmeesters May 26, 2025
bd4c173
fix: added missing pandas dependency
cmeesters May 30, 2025
1e06187
fix: stupid mistake - pandas is not part of the ci workflow, but a de…
cmeesters May 30, 2025
266aa41
fix: removed silly run statement
cmeesters May 30, 2025
05019fb
fix: modified test
cmeesters May 30, 2025
135c47a
fix: blacked
cmeesters May 30, 2025
af84f85
fix: removed unused import of glob
cmeesters May 30, 2025
ebeb08b
fix: now using simple assert
cmeesters May 30, 2025
2d1b041
fix: another attempt to get formatting and linting right
cmeesters May 30, 2025
bf3e88f
gnarf: no idea what went wrong
cmeesters May 30, 2025
082e01a
fix: removed explainatory statement at assertion
cmeesters May 30, 2025
4e99e32
fix: attempt to overwrite class method
cmeesters May 30, 2025
67b0671
fix: last attempt to fix the ci - pathlib search everywhere
cmeesters May 30, 2025
a1c5ba2
fix: removed unused imports
cmeesters May 30, 2025
86ff31a
fix: outcommented check for log file path
cmeesters May 30, 2025
0cb4b7f
fix: removed unused re import
cmeesters May 30, 2025
b433010
fix: removed useless line
cmeesters May 30, 2025
efd6398
fix: attempt to check more dirs
cmeesters May 30, 2025
7267dab
fix: typo
cmeesters May 30, 2025
2c334ad
fix: log -> csv suffix; added printing of cwd (temporary)
cmeesters Jun 2, 2025
a26e395
fix: refactored - got rid of own colorizing methods
cmeesters Jun 2, 2025
02a855d
fix: reformatted
cmeesters Jun 2, 2025
ee59544
fix: removed unnecessary colorizing method
cmeesters Jun 2, 2025
db9116a
Merge branch 'feat/eff-report' of github.com:snakemake/snakemake-exec…
cmeesters Jun 2, 2025
a0a0618
debug: merely inserted printf debug statements, will be removed
cmeesters Jun 2, 2025
96551c6
fix: first step to customizable logdir - better debugging
cmeesters Jun 3, 2025
7c8e3d6
fix: accounting for changed suffix
cmeesters Jun 3, 2025
4167460
fix: typos
cmeesters Jun 3, 2025
07d663a
fix: syntax
cmeesters Jun 3, 2025
55e7d58
fix: gnaaarf
cmeesters Jun 3, 2025
2dbc6a8
fix: formatting
cmeesters Jun 3, 2025
aadbf8c
fix: apparently snakemake cli not available
cmeesters Jun 3, 2025
820a9db
fix: formatting ...
cmeesters Jun 3, 2025
b56bf06
fix: cleanup
cmeesters Jun 3, 2025
2b72aa4
fix: should work ...
cmeesters Jun 3, 2025
8b2e100
fix: missing / superfluous imports
cmeesters Jun 3, 2025
fb826d1
fix: attempting old fashioned path walk
cmeesters Jun 3, 2025
930b70e
fix: _ for dirs in path walk
cmeesters Jun 3, 2025
cb3db0c
feat: customizable efficiency log path and threshold
cmeesters Jun 3, 2025
91815df
fix: attempting with new efficiency report path
cmeesters Jun 3, 2025
f983322
fix: gnarf - added pathlib import#
cmeesters Jun 3, 2025
66f0dc3
fix: own path did not work
cmeesters Jun 3, 2025
602db18
fix: included tmp_path output to visualize difference
cmeesters Jun 3, 2025
b7c80ef
fix: pleasing the linter
cmeesters Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/announce-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Post to Mastodon
uses: snakemake/[email protected].0
uses: snakemake/[email protected].1
with:
access-token: ${{ secrets.MASTODONBOT }}
pr-title: ${{ github.event.head_commit.message }}
Expand Down
4 changes: 4 additions & 0 deletions docs/further.md
Original file line number Diff line number Diff line change
Expand Up @@ -618,6 +618,10 @@ This configuration directs SLURM logs to a centralized location, making them eas
Running Snakemake within an active SLURM job can lead to unpredictable behavior, as the execution environment may not be properly configured for job submission.
To mitigate potential issues, the SLURM executor plugin detects when it's operating inside a SLURM job and issues a warning, pausing for 5 seconds before proceeding.

### Getting Job Efficiency Information

With `--slurm-efficiency-report` you can generate a table of all efficiency data. A logfile `efficiency_report_<workflow_id>.log` will be generated in your current directory. This is equivalent to the information with `seff <jobid>` for individual jobs. It works best if "comments" are stored as a job property on your cluster as this plugin uses the "comment" parameter to store the rule name.

### Frequently Asked Questions

#### Should I run Snakemake on the Login Node of my Cluster?
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ flake8 = "^6.1.0"
coverage = "^7.3.1"
pytest = "^8.3.5"
snakemake = "^9.4.0"
pandas = "^2.2.3"

[tool.coverage.run]
omit = [".*", "*/site-packages/*", "Snakefile"]
Expand Down
202 changes: 192 additions & 10 deletions snakemake_executor_plugin_slurm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
__email__ = "[email protected]"
__license__ = "MIT"

import atexit
import csv
from io import StringIO
import os
Expand All @@ -16,6 +15,9 @@
from datetime import datetime, timedelta
from typing import List, Generator, Optional
import uuid

import pandas as pd

from snakemake_interface_executor_plugins.executors.base import SubmittedJobInfo
from snakemake_interface_executor_plugins.executors.remote import RemoteExecutor
from snakemake_interface_executor_plugins.settings import (
Expand All @@ -27,7 +29,12 @@
)
from snakemake_interface_common.exceptions import WorkflowError

from .utils import delete_slurm_environment, delete_empty_dirs, set_gres_string
from .utils import (
delete_slurm_environment,
delete_empty_dirs,
set_gres_string,
)
from .efficiency_report import time_to_seconds, parse_maxrss, parse_reqmem
from .submit_string import get_submit_command


Expand Down Expand Up @@ -106,6 +113,37 @@ class ExecutorSettings(ExecutorSettingsBase):
"required": False,
},
)
efficiency_report: bool = field(
default=False,
metadata={
"help": "Generate an efficiency report at the end of the workflow. "
"This flag has no effect, if not set.",
"env_var": False,
"required": False,
},
)
efficiency_report_path: Optional[Path] = field(
default=None,
metadata={
"help": "Path to the efficiency report file. "
"If not set, the report will be written to "
"the current working directory with the name "
"'efficiency_report_<run_uuid>.csv'. "
"This flag has no effect, if not set.",
"env_var": False,
"required": False,
},
)
efficiency_threshold: Optional[float] = field(
default=0.8,
metadata={
"help": "The efficiency threshold for the efficiency report. "
"Jobs with an efficiency below this threshold will be reported. "
"This flag has no effect, if not set.",
"env_var": False,
"required": False,
},
)


# Required:
Expand Down Expand Up @@ -149,7 +187,21 @@ def __post_init__(self, test_mode: bool = False):
if self.workflow.executor_settings.logdir
else Path(".snakemake/slurm_logs").resolve()
)
atexit.register(self.clean_old_logs)

def shutdown(self) -> None:
"""
Shutdown the executor.
This method is overloaded, to include the cleaning of old log files
and to optionally create an efficiency report.
"""
# First, we invoke the original shutdown method
super().shutdown()

# Next, clean up old log files, unconditionally.
self.clean_old_logs()
# If the efficiency report is enabled, create it.
if self.workflow.executor_settings.efficiency_report:
self.create_efficiency_report()

def clean_old_logs(self) -> None:
"""Delete files older than specified age from the SLURM log directory."""
Expand All @@ -160,20 +212,23 @@ def clean_old_logs(self) -> None:
return
cutoff_secs = age_cutoff * 86400
current_time = time.time()
self.logger.info(f"Cleaning up log files older than {age_cutoff} day(s)")
self.logger.info(f"Cleaning up log files older than {age_cutoff} day(s).")

for path in self.slurm_logdir.rglob("*.log"):
if path.is_file():
try:
file_age = current_time - path.stat().st_mtime
if file_age > cutoff_secs:
path.unlink()
except (OSError, FileNotFoundError) as e:
self.logger.warning(f"Could not delete logfile {path}: {e}")
self.logger.error(f"Could not delete logfile {path}: {e}")
# we need a 2nd iteration to remove putatively empty directories
try:
delete_empty_dirs(self.slurm_logdir)
except (OSError, FileNotFoundError) as e:
self.logger.warning(f"Could not delete empty directory {path}: {e}")
self.logger.error(
f"Could not delete empty directories in {self.slurm_logdir}: {e}"
)

def warn_on_jobcontext(self, done=None):
if not done:
Expand Down Expand Up @@ -730,10 +785,137 @@ def check_slurm_extra(self, job):
jobname = re.compile(r"--job-name[=?|\s+]|-J\s?")
if re.search(jobname, job.resources.slurm_extra):
raise WorkflowError(
"The --job-name option is not allowed in the 'slurm_extra' "
"parameter. The job name is set by snakemake and must not be "
"overwritten. It is internally used to check the stati of the "
"all submitted jobs by this workflow."
"The --job-name option is not allowed in the 'slurm_extra' parameter. "
"The job name is set by snakemake and must not be overwritten. "
"It is internally used to check the stati of the all submitted jobs "
"by this workflow."
"Please consult the documentation if you are unsure how to "
"query the status of your jobs."
)

def create_efficiency_report(self):
"""
Fetch sacct job data for a Snakemake workflow
and compute efficiency metrics.
"""
cmd = f"sacct --name={self.run_uuid} --parsable2 --noheader"
cmd += (
" --format=JobID,JobName,Comment,Elapsed,TotalCPU,"
"NNodes,NCPUS,MaxRSS,ReqMem"
)
e_threshold = self.workflow.executor_settings.efficiency_threshold

try:
result = subprocess.run(
shlex.split(cmd), capture_output=True, text=True, check=True
)
lines = result.stdout.strip().split("\n")
except subprocess.CalledProcessError:
self.logger.error(
f"Failed to retrieve job data for workflow {self.run_uuid}."
)
return None

# Convert to DataFrame
df = pd.DataFrame(
(line.split("|") for line in lines),
columns=[
"JobID",
"JobName",
"Comment",
"Elapsed",
"TotalCPU",
"NNodes",
"NCPUS",
"MaxRSS",
"ReqMem",
],
)

# If the "Comment" column is empty,
# a) delete the column
# b) issue a warning
if df["Comment"].isnull().all():
self.logger.warning(
f"No comments found for workflow {self.run_uuid}. "
"This field is used to store the rule name. "
"Please ensure that the 'comment' field is set for your cluster. "
"Administrators can set this up in the SLURM configuration."
)
df.drop(columns=["Comment"], inplace=True)
# remember, that the comment column is not available
nocomment = True
# else: rename the column to 'RuleName'
else:
df.rename(columns={"Comment": "RuleName"}, inplace=True)
nocomment = False
# Convert types
df["NNodes"] = pd.to_numeric(df["NNodes"], errors="coerce")
df["NCPUS"] = pd.to_numeric(df["NCPUS"], errors="coerce")

# Convert time fields
df["Elapsed_sec"] = df["Elapsed"].apply(time_to_seconds)
df["TotalCPU_sec"] = df["TotalCPU"].apply(time_to_seconds)

# Compute CPU efficiency
df["CPU Efficiency (%)"] = (
df["TotalCPU_sec"] / (df["Elapsed_sec"] * df["NCPUS"])
) * 100
df["CPU Efficiency (%)"] = df["CPU Efficiency (%)"].fillna(0).round(2)

# Convert MaxRSS
df["MaxRSS_MB"] = df["MaxRSS"].apply(parse_maxrss)

# Convert ReqMem and calculate memory efficiency
df["RequestedMem_MB"] = df.apply(
lambda row: parse_reqmem(row["ReqMem"], row["NNodes"]), axis=1
)
df["Memory Usage (%)"] = df.apply(
lambda row: (
(row["MaxRSS_MB"] / row["RequestedMem_MB"] * 100)
if row["RequestedMem_MB"] > 0
else 0
),
axis=1,
)

df["Memory Usage (%)"] = df["Memory Usage (%)"].fillna(0).round(2)

# Drop all rows containing "batch" or "extern" as job names
df = df[~df["JobName"].str.contains("batch|extern")]

# Log warnings for low efficiency
for _, row in df.iterrows():
if row["CPU Efficiency (%)"] < e_threshold:
if nocomment:
self.logger.warning(
f"Job {row['JobID']} ({row['JobName']}) "
f"has low CPU efficiency: {row['CPU Efficiency (%)']}%."
)
else:
# if the comment column is available, we can use it to
# identify the rule name
self.logger.warning(
f"Job {row['JobID']} for rule '{row['RuleName']}' "
f"({row['JobName']}) has low CPU efficiency: "
f"{row['CPU Efficiency (%)']}%."
)

# we construct a path object to allow for a customi
# logdir, if specified
p = Path()

# Save the report to a CSV file
logfile = f"efficiency_report_{self.run_uuid}.csv"
if self.workflow.executor_settings.efficiency_report_path:
logfile = (
Path(self.workflow.executor_settings.efficiency_report_path) / logfile
)
else:
logfile = p.cwd() / logfile
df.to_csv(logfile)

# write out the efficiency report at normal verbosity in any case
self.logger.info(
f"Efficiency report for workflow {self.run_uuid} saved to {logfile}."
)
50 changes: 50 additions & 0 deletions snakemake_executor_plugin_slurm/efficiency_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import re
import pandas as pd


def time_to_seconds(time_str):
"""Convert SLURM time format to seconds."""
if pd.isna(time_str) or time_str.strip() == "":
return 0
parts = time_str.split(":")

if len(parts) == 3: # H:M:S
return int(parts[0]) * 3600 + int(parts[1]) * 60 + float(parts[2])
elif len(parts) == 2: # M:S
return int(parts[0]) * 60 + float(parts[1])
elif len(parts) == 1: # S
return float(parts[0])
return 0


def parse_maxrss(maxrss):
"""Convert MaxRSS to MB."""
if pd.isna(maxrss) or maxrss.strip() == "" or maxrss == "0":
return 0
match = re.match(r"(\d+)([KMG]?)", maxrss)
if match:
value, unit = match.groups()
value = int(value)
unit_multipliers = {"K": 1 / 1024, "M": 1, "G": 1024}
return value * unit_multipliers.get(unit, 1)
return 0


def parse_reqmem(reqmem, number_of_nodes=1):
"""Convert requested memory to MB."""
if pd.isna(reqmem) or reqmem.strip() == "":
return 0
match = re.match(
r"(\d+)([KMG])?(\S+)?", reqmem
) # Handles "4000M" or "4G" or "2G/node"
if match:
value, unit, per_unit = match.groups()
value = int(value)
unit_multipliers = {"K": 1 / 1024, "M": 1, "G": 1024}
mem_mb = value * unit_multipliers.get(unit, 1)
if per_unit and "/node" in per_unit:
# the memory values is per node, hence we need to
# multiply with the number of nodes
return mem_mb * number_of_nodes
return mem_mb # Default case (per CPU or total)
return 0
Loading
Loading