Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

eloualiche · 2025-05-30T13:52:02Z

Preamble

Versions

$ snakemake --version
$ uv tool run --from snakemake python -c "import importlib.metadata; print(f'snakemake-executor-plugin-slurm: {importlib.metadata.version(\"snakemake-executor-plugin-slurm\")}')"
snakemake-executor-plugin-slurm: 1.3.6
$ sinfo --version
slurm 23.11.8

Description

Slurm executor adds a ntasks-per-gpu=1 option as a default. I cannot find a way to disable it.
This leads to issues on jobs submitted with 2 gpus.

An easy fix could be (non-breaking) to allow a flag value that disables the option

snakemake-executor-plugin-slurm/snakemake_executor_plugin_slurm/submit_string.py

Line 57 in ec82a70

call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}"

if gpu_job:
    ntasks_per_gpu_val = job.resources.get('ntasks_per_gpu', job.resources.get('tasks', 1))
    if ntasks_per_gpu_val != 0:  # or whichever flag is appropriate to remove the command
        call += f" --ntasks-per-gpu={ntasks_per_gpu_val}"
else:
    call += f" --ntasks={job.resources.get('tasks', 1)}"

This is just a sketch as I dont know enough about how this plugin has decided to handle flags etc...

Below are the logs.

The Rule

rule TEST:
    output: "output/JSON/structured_10k_analysis.json"
    params:
        model="mistralai/Mistral-Nemo-Instruct-2407",
        cik=66740,
        daterange=["2000-01-01", "2002-01-01"]
    resources: jobs=1, nodes=1, ntasks=1, tasks=1, cpus_per_gpu=2, mem_mb=64000, tmp=32000, slurm_partition="preempt-gpu,msigpu", gres="gpu:a40:2", runtime=30, slurm_account="eloualic"
    log: "log/TEST_VLLM_10K.log"
    shell: """
    echo "=== Job Started: $(date) ===" &> {log}
    source  {python_venv}/bin/activate  # activate uv python env    
    echo "=== Slurm GPU Allocation Debug ===" &>> {log}
    echo "SLURM_JOB_GPUS: ${{SLURM_JOB_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_STEP_GPUS: ${{SLURM_STEP_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_GPUS_ON_NODE: ${{SLURM_GPUS_ON_NODE:-'not set'}}" &>> {log}
    echo "SLURM_JOB_ID: ${{SLURM_JOB_ID:-'not set'}}" &>> {log}
    echo "SLURM_NODELIST: ${{SLURM_NODELIST:-'not set'}}" &>> {log}

    echo "=== All GPUs on this node ===" &>> {log}
    nvidia-smi -L &>> {log}
    nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv &>> {log}

    # Test 1: Default (what Slurm set)
    echo "Test 1 - Default CUDA_VISIBLE_DEVICES: ${{CUDA_VISIBLE_DEVICES:-'not set'}}" &>> {log}
    python -c "import torch; print(f'Test 1 torch views: {{torch.cuda.device_count()}} GPUs')" &>> {log}
    echo "=== GPU Debug Complete - NOT starting VLLM yet ===" &>> {log}

    echo "=== Starting VLLM with all GPUs: $(date) ===" &>> {log}
    uv run --project {python_project} python -m vllm.entrypoints.openai.api_server --model {params.model} --port 8000 --host 0.0.0.0 --max-model-len 128000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 &>> {log} 

    """

Snakemake execution

I executed the rule with
snakemake --executor slurm -j1 -R TEST_VLLM_10K --verbose

Log

=== All GPUs on this node ===                                                                                                                                                                                 │
│=== Slurm GPU Allocation Debug ===                                                                                                                                                                            │
│SLURM_JOB_GPUS: 0,2                                                                                                                                                                                           │
│SLURM_STEP_GPUS: 0                                                                                                                                                                                            │
│SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                         │
│SLURM_JOB_ID: 36160287                                                                                                                                                                                        │
│SLURM_NODELIST: agc03                                                                                                                                                                                         │
│=== All GPUs on this node ===                                                                                                                                                                                 │
│GPU 0: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                            │
│GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                            │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│2025-05-29 23:32:23,970        INFO worker.py:1888 -- Started a local Ray instance.                                                                                                                           │
│2025-05-29 23:32:23,979        INFO worker.py:1888 -- Started a local Ray instance.

What seems to happen is that this runs the code twice on two instances, each having access to a different gpu (see the uuid).
Torch only ever sees one of the gpu at a time which means it never pools memory.

Srun execution

I copy pasted the rule from the verbose log of snakemake. I only removed ntasks-per-gpu=1 option

sbatch --parsable --job-name f7d8b643-6865-4907-81d8-da0ca1357747 --output "/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/slurm_logs/rule_TEST_VLLM_10K/%j.log" --export=ALL --comment "TEST"  -A 'eloualic'  -p preempt-gpu,msigpu -t 30 --mem 64000 --nodes=1 --cpus-per-gpu=2 -D '/scratch.global/eloualic/llm-in-finance/llm_testing' --gres=gpu:a40:2 --wrap="/home/eloualic/eloualic/.local/uv/tools/snakemake/bin/python -m snakemake --snakefile '/scratch.global/eloualic/llm-in-finance/llm_testing/Snakefile' --target-jobs 'TEST_VLLM_10K:' --allowed-rules TEST_VLLM_10K --cores 'all' --attempt 1 --force-use-threads  --resources 'jobs=1' 'nodes=1' 'ntasks=1' 'tasks=1' 'cpus_per_gpu=2' 'mem_mb=64000' 'mem_mib=61036' 'tmp=32000' --wait-for-files '/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/tmp.14gj3wul' 'src/test_ollama_api.jl' 'src/jl_routines/VLLMInterface.jl' 'src/jl_routines/M_PULL10K.jl' --force --target-files-omit-workdir-adjustment --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers params software-env mtime input code --conda-frontend 'conda' --shared-fs-usage sources storage-local-copies software-deployment persistence input-output source-cache --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --local-storage-prefix base64//LnNuYWtlbWFrZS9zdG9yYWdl --scheduler-solver-path '/home/eloualic/eloualic/.local/uv/tools/snakemake/bin' --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= --executor slurm-jobstep --jobs 1 --mode 'remote'"

The vllm server started and gpus did show up together.

=== Job Started: Thu May 29 23:36:00 CDT 2025 ===                                                                                                                                                                                                       │
=== Slurm GPU Allocation Debug ===                                                                                                                                                                                                                      │
SLURM_JOB_GPUS: 0,2                                                                                                                                                                                                                                     │
SLURM_STEP_GPUS: 0,2                                                                                                                                                                                                                                    │
SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                                                                   │
SLURM_JOB_ID: 36160324                                                                                                                                                                                                                                  │
SLURM_NODELIST: agc03                                                                                                                                                                                                                                   │
=== All GPUs on this node ===                                                                                                                                                                                                                           │
GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                                                                      │
GPU 1: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                                                                      │
index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                                                                      │
0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
1, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
Test 1 - Default CUDA_VISIBLE_DEVICES: 0,1                                                                                                                                                                                                              │
Test 1 torch views: 2 GPUs                                                                                                                                                                                                                              │
=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                                                                      │
=== Starting VLLM with all GPUs: Thu May 29 23:36:02 CDT 2025 ===                                                                                                                                                                                       │
INFO 05-29 23:36:11 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:14 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                                                            │
INFO 05-29 23:36:14 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=No│
INFO 05-29 23:36:26 [config.py:717] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.                                                                                                 │
INFO 05-29 23:36:26 [config.py:1770] Defaulting to use mp for distributed inference                                                                                                                                                                     │
INFO 05-29 23:36:26 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                                                                       │
/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ens│
  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                                                                 │
INFO 05-29 23:36:34 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:37 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-2407', skip_tokenizer_init=False, tokenizer_│
INFO 05-29 23:36:37 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_8d0f238d'), local_subscribe_addr='ipc:///tmp/7ab51e1d-62cb-472c-805e-c9de12556940', remote_su│
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fdaf8d246e0>                                  │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f56f61ab380>                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_361fa885'), local_subscribe_addr='ipc:///tmp/112587c6-3966-4a8c-│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aeb12a1f'), local_subscribe_addr='ipc:///tmp/10ffba7d-06d5-4373-│
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1fc07353'), local_subscribe_addr='ipc:///tmp/3e56dcfa-5ae6-4944-9b│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]                                                                                                                                           │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.83it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.19it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.10it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.07it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.09it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.13it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)                                                                                                                                                                                                                         │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.60 seconds                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.74 seconds                                                                                                                                                   │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.338305 seconds                                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.682200 seconds                                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_0_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_1_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.41 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.42 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.803 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.827 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.41 s in total                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.42 s in total                                                                                                                                                │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
INFO 05-29 23:38:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 113.97 seconds                                                                                                                                              │
INFO 05-29 23:38:54 [core_client.py:439] Core engine process 0 ready.                                                                                                                                                                                   │
INFO 05-29 23:38:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000                                                                                                                                                                │
INFO 05-29 23:38:54 [launcher.py:28] Available routes are:                                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD                                                                                                                                                                           │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /health, Methods: GET                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /load, Methods: GET                                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /ping, Methods: GET, POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /tokenize, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /detokenize, Methods: POST                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/models, Methods: GET                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /version, Methods: GET                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/chat/completions, Methods: POST                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/completions, Methods: POST                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/embeddings, Methods: POST                                                                                                                                                                               │
INFO 05-29 23:38:54 [launcher.py:36] Route: /pooling, Methods: POST                                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /score, Methods: POST                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/score, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /rerank, Methods: POST                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v2/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /invocations, Methods: POST                                                                                                                                                                                 │
INFO 05-29 23:38:54 [launcher.py:36] Route: /metrics, Methods: GET                                                                                                                                                                                      │
INFO:     Started server process [3008420]                                                                                                                                                                                                              │
INFO:     Waiting for application startup.                                                                                                                                                                                                              │
INFO:     Application startup complete.

The text was updated successfully, but these errors were encountered:

cmeesters · 2025-06-01T14:48:27Z

Thank you for your detailed bug report. I think what happened here is: I tested with an application, which uses n GPUs and a task per GPU. It works like a charm - same SLURM version as on your cluster. Here, though, SLURM seems to take the job apart with 1 task per GPU. I am not sure why this happens.

Your suggested change seems innocent enough. I will have to test it anyhow. That might take some time.

eloualiche · 2025-06-01T19:02:39Z

Yes. I understand. From looking at the implementation of how the slurm request is constructed there was no obvious way to add a flag while keeping all your APIs nice and tidy even though it is just an on/off switch.
I will talk support on my end (at the Minnesota Supercomputing Institute) to see if there is something that might come from their slurm setup.

At least it would be nice to document the issue if someone else encounters it.

Thank you for your help.

cmeesters · 2025-06-02T09:27:34Z

@eloualiche please test the code from PR 318. I appreciate feedback.

Note: a tasks value <=0 will unset the flag.

NB Also, I doubt that you configure your workflow hardcoded in rules other than for such reports, but we recommend using workflow profiles and keep the workflow as generic as possible.

eloualiche · 2025-06-03T01:47:53Z

Ok. This worked on my end with the PR on two different tests.
Merging is much appreciated!!

Thank you so much for getting this done so fast.

cmeesters · 2025-06-04T13:50:20Z

Right now, we have a big documentation PR pending. I very much prefer, to get this done such, that I do not need to merge this and to work through piles of text for every add-on. Should be done by the end of the week.

cmeesters added the bug/fix Something isn't working label Jun 2, 2025

cmeesters linked a pull request Jun 2, 2025 that will close this issue

fix: allow unsetting of tasks for gpu jobs #318

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

eloualiche commented May 30, 2025

cmeesters commented Jun 1, 2025

Uh oh!

eloualiche commented Jun 1, 2025

Uh oh!

cmeesters commented Jun 2, 2025

Uh oh!

eloualiche commented Jun 3, 2025

Uh oh!

cmeesters commented Jun 4, 2025

Uh oh!

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

Comments

eloualiche commented May 30, 2025

Preamble

Versions

Description

The Rule

Snakemake execution

Srun execution

cmeesters commented Jun 1, 2025

Uh oh!

eloualiche commented Jun 1, 2025

Uh oh!

cmeesters commented Jun 2, 2025

Uh oh!

eloualiche commented Jun 3, 2025

Uh oh!

cmeesters commented Jun 4, 2025

Uh oh!