test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

oandreeva-nv · 2024-07-17T23:08:16Z

In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue to Triton starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed. I'll create an issue a bit later.

Solution: support only Ray for deploying models with tensor_parallel_size > 1 via "distributed_executor_backend" flag until the issue is fixed.

This PR adjusts our multi-gpu tests according to the above observations

rmccorm4 · 2024-07-18T19:24:03Z

ci/L0_multi_gpu/multi_lora/test.sh

@@ -62,7 +62,8 @@ model_json=$(cat <<EOF
    "enforce_eager": "true",
    "enable_lora": "true",
    "max_lora_rank": 32,
-    "lora_extra_vocab_size": 256
+    "lora_extra_vocab_size": 256,
+    "distributed_executor_backend":"ray"


For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired.

Few questions on this for my own understanding moving forward:

Do we know more details or limitations on why this is happening?

Is this an error happening on server shutdown?

Is there some issue with the python native multiprocessing due to the details of Triton's python backend launching each instance as a separate process?

Is this with 1, 2, or any amount of model instances with KIND_MODEL?

Do we know more details or limitations on why this is happening?

Issue with unclear multi-gpu test fail when upgrading to 0.5.0 versions and up
In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed.
Solution: add "distributed_executor_backend":"ray" to model.json

Is this an error happening on server shutdown?

Yes, and I have a reproducer outside of triton

Is this with 1, 2, or any amount of model instances with KIND_MODEL?

If "distributed_executor_backend" field is not specified, than for tp>2 and distributed among a single node, than MP backend kicks in. However, I've noticed that even when tp=1 and "distributed_executor_backend" is specified in model.json, vllm will go through distributed serving even when tp=1. More on the slack channel for this behavior

rcarrata · 2024-07-23T12:19:02Z

@oandreeva-nv afaik you can setup the --distributed-executor-backend to ray and avoid the usage of MP.

From the docs of distributed serving:
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured tensor_parallel_size, otherwise Ray will be used. This default can be overridden via the LLM class distributed-executor-backend argument or --distributed-executor-backend API server argument. Set it to mp for multiprocessing or ray for Ray. It’s not required for Ray to be installed for the multiprocessing case.

oandreeva-nv · 2024-07-23T17:11:16Z

@rcarrata That's what I'm doing in this PR. I'm making sure that ray is used for distributed testing. Or did I misunderstood your comment?

oandreeva-nv added 9 commits July 15, 2024 16:22

Added debug prints

8ad31cf

[skip ci] Changed tag for testing

0db13b2

Setting distributed backend to ray by default

34f6bc8

Added explicit model load back

bc49f23

Checking

0981292

Checking

e407fbd

Adjustments

f5ef1f3

bringing Kind_model back

c044ffa

removing irrelevant exports

f96e172

oandreeva-nv mentioned this pull request Jul 17, 2024

[build]: Bumping vllm version to v0.5.3.post1 triton-inference-server/server#7453

Merged

20 tasks

oandreeva-nv marked this pull request as ready for review July 17, 2024 23:13

oandreeva-nv changed the title ~~Oandreeva vllm 0.5.2~~ Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024

oandreeva-nv requested review from rmccorm4 and fpetrini15 July 17, 2024 23:15

rmccorm4 changed the title ~~Limiting multi-gpu tests to use Ray as distributed_executor_backend~~ test: Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024

rmccorm4 reviewed Jul 18, 2024

View reviewed changes

fpetrini15 approved these changes Jul 25, 2024

View reviewed changes

oandreeva-nv merged commit 05c5a8b into main Jul 25, 2024
3 checks passed

oandreeva-nv deleted the oandreeva_vllm_0.5.2 branch July 25, 2024 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

oandreeva-nv commented Jul 17, 2024 •

edited

Loading

rmccorm4 Jul 18, 2024 •

edited

Loading

oandreeva-nv Jul 23, 2024

rcarrata commented Jul 23, 2024

oandreeva-nv commented Jul 23, 2024

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

Conversation

oandreeva-nv commented Jul 17, 2024 • edited Loading

rmccorm4 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

oandreeva-nv Jul 23, 2024

Choose a reason for hiding this comment

rcarrata commented Jul 23, 2024

oandreeva-nv commented Jul 23, 2024

oandreeva-nv commented Jul 17, 2024 •

edited

Loading

rmccorm4 Jul 18, 2024 •

edited

Loading