Skip to content

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eloualiche opened this issue May 30, 2025 · 5 comments · May be fixed by #318
Open

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

eloualiche opened this issue May 30, 2025 · 5 comments · May be fixed by #318
Labels
bug/fix Something isn't working

Comments

@eloualiche
Copy link

Preamble

Versions

$ snakemake --version
$ uv tool run --from snakemake python -c "import importlib.metadata; print(f'snakemake-executor-plugin-slurm: {importlib.metadata.version(\"snakemake-executor-plugin-slurm\")}')"
snakemake-executor-plugin-slurm: 1.3.6
$ sinfo --version
slurm 23.11.8

Description

Slurm executor adds a ntasks-per-gpu=1 option as a default. I cannot find a way to disable it.
This leads to issues on jobs submitted with 2 gpus.

An easy fix could be (non-breaking) to allow a flag value that disables the option

call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}"

if gpu_job:
    ntasks_per_gpu_val = job.resources.get('ntasks_per_gpu', job.resources.get('tasks', 1))
    if ntasks_per_gpu_val != 0:  # or whichever flag is appropriate to remove the command
        call += f" --ntasks-per-gpu={ntasks_per_gpu_val}"
else:
    call += f" --ntasks={job.resources.get('tasks', 1)}"

This is just a sketch as I dont know enough about how this plugin has decided to handle flags etc...

Below are the logs.

The Rule

rule TEST:
    output: "output/JSON/structured_10k_analysis.json"
    params:
        model="mistralai/Mistral-Nemo-Instruct-2407",
        cik=66740,
        daterange=["2000-01-01", "2002-01-01"]
    resources: jobs=1, nodes=1, ntasks=1, tasks=1, cpus_per_gpu=2, mem_mb=64000, tmp=32000, slurm_partition="preempt-gpu,msigpu", gres="gpu:a40:2", runtime=30, slurm_account="eloualic"
    log: "log/TEST_VLLM_10K.log"
    shell: """
    echo "=== Job Started: $(date) ===" &> {log}
    source  {python_venv}/bin/activate  # activate uv python env    
    echo "=== Slurm GPU Allocation Debug ===" &>> {log}
    echo "SLURM_JOB_GPUS: ${{SLURM_JOB_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_STEP_GPUS: ${{SLURM_STEP_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_GPUS_ON_NODE: ${{SLURM_GPUS_ON_NODE:-'not set'}}" &>> {log}
    echo "SLURM_JOB_ID: ${{SLURM_JOB_ID:-'not set'}}" &>> {log}
    echo "SLURM_NODELIST: ${{SLURM_NODELIST:-'not set'}}" &>> {log}

    echo "=== All GPUs on this node ===" &>> {log}
    nvidia-smi -L &>> {log}
    nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv &>> {log}

    # Test 1: Default (what Slurm set)
    echo "Test 1 - Default CUDA_VISIBLE_DEVICES: ${{CUDA_VISIBLE_DEVICES:-'not set'}}" &>> {log}
    python -c "import torch; print(f'Test 1 torch views: {{torch.cuda.device_count()}} GPUs')" &>> {log}
    echo "=== GPU Debug Complete - NOT starting VLLM yet ===" &>> {log}

    echo "=== Starting VLLM with all GPUs: $(date) ===" &>> {log}
    uv run --project {python_project} python -m vllm.entrypoints.openai.api_server --model {params.model} --port 8000 --host 0.0.0.0 --max-model-len 128000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 &>> {log} 

    """

Snakemake execution

I executed the rule with
snakemake --executor slurm -j1 -R TEST_VLLM_10K --verbose

Log

=== All GPUs on this node ===                                                                                                                                                                                 │
│=== Slurm GPU Allocation Debug ===                                                                                                                                                                            │
│SLURM_JOB_GPUS: 0,2                                                                                                                                                                                           │
│SLURM_STEP_GPUS: 0                                                                                                                                                                                            │
│SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                         │
│SLURM_JOB_ID: 36160287                                                                                                                                                                                        │
│SLURM_NODELIST: agc03                                                                                                                                                                                         │
│=== All GPUs on this node ===                                                                                                                                                                                 │
│GPU 0: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                            │
│GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                            │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│2025-05-29 23:32:23,970        INFO worker.py:1888 -- Started a local Ray instance.                                                                                                                           │
│2025-05-29 23:32:23,979        INFO worker.py:1888 -- Started a local Ray instance.

What seems to happen is that this runs the code twice on two instances, each having access to a different gpu (see the uuid).
Torch only ever sees one of the gpu at a time which means it never pools memory.

Srun execution

I copy pasted the rule from the verbose log of snakemake. I only removed ntasks-per-gpu=1 option

sbatch --parsable --job-name f7d8b643-6865-4907-81d8-da0ca1357747 --output "/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/slurm_logs/rule_TEST_VLLM_10K/%j.log" --export=ALL --comment "TEST"  -A 'eloualic'  -p preempt-gpu,msigpu -t 30 --mem 64000 --nodes=1 --cpus-per-gpu=2 -D '/scratch.global/eloualic/llm-in-finance/llm_testing' --gres=gpu:a40:2 --wrap="/home/eloualic/eloualic/.local/uv/tools/snakemake/bin/python -m snakemake --snakefile '/scratch.global/eloualic/llm-in-finance/llm_testing/Snakefile' --target-jobs 'TEST_VLLM_10K:' --allowed-rules TEST_VLLM_10K --cores 'all' --attempt 1 --force-use-threads  --resources 'jobs=1' 'nodes=1' 'ntasks=1' 'tasks=1' 'cpus_per_gpu=2' 'mem_mb=64000' 'mem_mib=61036' 'tmp=32000' --wait-for-files '/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/tmp.14gj3wul' 'src/test_ollama_api.jl' 'src/jl_routines/VLLMInterface.jl' 'src/jl_routines/M_PULL10K.jl' --force --target-files-omit-workdir-adjustment --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers params software-env mtime input code --conda-frontend 'conda' --shared-fs-usage sources storage-local-copies software-deployment persistence input-output source-cache --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --local-storage-prefix base64//LnNuYWtlbWFrZS9zdG9yYWdl --scheduler-solver-path '/home/eloualic/eloualic/.local/uv/tools/snakemake/bin' --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= --executor slurm-jobstep --jobs 1 --mode 'remote'"

The vllm server started and gpus did show up together.

=== Job Started: Thu May 29 23:36:00 CDT 2025 ===                                                                                                                                                                                                       │
=== Slurm GPU Allocation Debug ===                                                                                                                                                                                                                      │
SLURM_JOB_GPUS: 0,2                                                                                                                                                                                                                                     │
SLURM_STEP_GPUS: 0,2                                                                                                                                                                                                                                    │
SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                                                                   │
SLURM_JOB_ID: 36160324                                                                                                                                                                                                                                  │
SLURM_NODELIST: agc03                                                                                                                                                                                                                                   │
=== All GPUs on this node ===                                                                                                                                                                                                                           │
GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                                                                      │
GPU 1: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                                                                      │
index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                                                                      │
0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
1, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
Test 1 - Default CUDA_VISIBLE_DEVICES: 0,1                                                                                                                                                                                                              │
Test 1 torch views: 2 GPUs                                                                                                                                                                                                                              │
=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                                                                      │
=== Starting VLLM with all GPUs: Thu May 29 23:36:02 CDT 2025 ===                                                                                                                                                                                       │
INFO 05-29 23:36:11 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:14 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                                                            │
INFO 05-29 23:36:14 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=No│
INFO 05-29 23:36:26 [config.py:717] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.                                                                                                 │
INFO 05-29 23:36:26 [config.py:1770] Defaulting to use mp for distributed inference                                                                                                                                                                     │
INFO 05-29 23:36:26 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                                                                       │
/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ens│
  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                                                                 │
INFO 05-29 23:36:34 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:37 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-2407', skip_tokenizer_init=False, tokenizer_│
INFO 05-29 23:36:37 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_8d0f238d'), local_subscribe_addr='ipc:///tmp/7ab51e1d-62cb-472c-805e-c9de12556940', remote_su│
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fdaf8d246e0>                                  │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f56f61ab380>                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_361fa885'), local_subscribe_addr='ipc:///tmp/112587c6-3966-4a8c-│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aeb12a1f'), local_subscribe_addr='ipc:///tmp/10ffba7d-06d5-4373-│
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1fc07353'), local_subscribe_addr='ipc:///tmp/3e56dcfa-5ae6-4944-9b│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]                                                                                                                                           │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.83it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.19it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.10it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.07it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.09it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.13it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)                                                                                                                                                                                                                         │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.60 seconds                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.74 seconds                                                                                                                                                   │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.338305 seconds                                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.682200 seconds                                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_0_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_1_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.41 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.42 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.803 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.827 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.41 s in total                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.42 s in total                                                                                                                                                │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
INFO 05-29 23:38:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 113.97 seconds                                                                                                                                              │
INFO 05-29 23:38:54 [core_client.py:439] Core engine process 0 ready.                                                                                                                                                                                   │
INFO 05-29 23:38:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000                                                                                                                                                                │
INFO 05-29 23:38:54 [launcher.py:28] Available routes are:                                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD                                                                                                                                                                           │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /health, Methods: GET                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /load, Methods: GET                                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /ping, Methods: GET, POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /tokenize, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /detokenize, Methods: POST                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/models, Methods: GET                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /version, Methods: GET                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/chat/completions, Methods: POST                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/completions, Methods: POST                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/embeddings, Methods: POST                                                                                                                                                                               │
INFO 05-29 23:38:54 [launcher.py:36] Route: /pooling, Methods: POST                                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /score, Methods: POST                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/score, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /rerank, Methods: POST                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v2/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /invocations, Methods: POST                                                                                                                                                                                 │
INFO 05-29 23:38:54 [launcher.py:36] Route: /metrics, Methods: GET                                                                                                                                                                                      │
INFO:     Started server process [3008420]                                                                                                                                                                                                              │
INFO:     Waiting for application startup.                                                                                                                                                                                                              │
INFO:     Application startup complete.
@cmeesters
Copy link
Member

Thank you for your detailed bug report. I think what happened here is: I tested with an application, which uses n GPUs and a task per GPU. It works like a charm - same SLURM version as on your cluster. Here, though, SLURM seems to take the job apart with 1 task per GPU. I am not sure why this happens.

Your suggested change seems innocent enough. I will have to test it anyhow. That might take some time.

@eloualiche
Copy link
Author

Yes. I understand. From looking at the implementation of how the slurm request is constructed there was no obvious way to add a flag while keeping all your APIs nice and tidy even though it is just an on/off switch.
I will talk support on my end (at the Minnesota Supercomputing Institute) to see if there is something that might come from their slurm setup.

At least it would be nice to document the issue if someone else encounters it.

Thank you for your help.

@cmeesters cmeesters added the bug/fix Something isn't working label Jun 2, 2025
@cmeesters cmeesters linked a pull request Jun 2, 2025 that will close this issue
@cmeesters
Copy link
Member

@eloualiche please test the code from PR 318. I appreciate feedback.

Note: a tasks value <=0 will unset the flag.

NB Also, I doubt that you configure your workflow hardcoded in rules other than for such reports, but we recommend using workflow profiles and keep the workflow as generic as possible.

@eloualiche
Copy link
Author

Ok. This worked on my end with the PR on two different tests.
Merging is much appreciated!!

Thank you so much for getting this done so fast.

@cmeesters
Copy link
Member

Right now, we have a big documentation PR pending. I very much prefer, to get this done such, that I do not need to merge this and to work through piles of text for every add-on. Should be done by the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/fix Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants