-
Notifications
You must be signed in to change notification settings - Fork 28
Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for your detailed bug report. I think what happened here is: I tested with an application, which uses n GPUs and a task per GPU. It works like a charm - same SLURM version as on your cluster. Here, though, SLURM seems to take the job apart with 1 task per GPU. I am not sure why this happens. Your suggested change seems innocent enough. I will have to test it anyhow. That might take some time. |
Yes. I understand. From looking at the implementation of how the slurm request is constructed there was no obvious way to add a flag while keeping all your APIs nice and tidy even though it is just an on/off switch. At least it would be nice to document the issue if someone else encounters it. Thank you for your help. |
@eloualiche please test the code from PR 318. I appreciate feedback. Note: a NB Also, I doubt that you configure your workflow hardcoded in rules other than for such reports, but we recommend using workflow profiles and keep the workflow as generic as possible. |
Ok. This worked on my end with the PR on two different tests. Thank you so much for getting this done so fast. |
Right now, we have a big documentation PR pending. I very much prefer, to get this done such, that I do not need to merge this and to work through piles of text for every add-on. Should be done by the end of the week. |
Preamble
Versions
$ snakemake --version $ uv tool run --from snakemake python -c "import importlib.metadata; print(f'snakemake-executor-plugin-slurm: {importlib.metadata.version(\"snakemake-executor-plugin-slurm\")}')" snakemake-executor-plugin-slurm: 1.3.6 $ sinfo --version slurm 23.11.8
Description
Slurm executor adds a
ntasks-per-gpu=1
option as a default. I cannot find a way to disable it.This leads to issues on jobs submitted with 2 gpus.
An easy fix could be (non-breaking) to allow a flag value that disables the option
snakemake-executor-plugin-slurm/snakemake_executor_plugin_slurm/submit_string.py
Line 57 in ec82a70
This is just a sketch as I dont know enough about how this plugin has decided to handle flags etc...
Below are the logs.
The Rule
Snakemake execution
I executed the rule with
snakemake --executor slurm -j1 -R TEST_VLLM_10K --verbose
Log
What seems to happen is that this runs the code twice on two instances, each having access to a different gpu (see the uuid).
Torch only ever sees one of the gpu at a time which means it never pools memory.
Srun execution
I copy pasted the rule from the verbose log of snakemake. I only removed
ntasks-per-gpu=1
optionThe vllm server started and gpus did show up together.
The text was updated successfully, but these errors were encountered: