Skip to content

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 25, 2024

Conversation

oandreeva-nv
Copy link
Contributor

@oandreeva-nv oandreeva-nv commented Jul 17, 2024

In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue to Triton starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed. I'll create an issue a bit later.

Solution: support only Ray for deploying models with tensor_parallel_size > 1 via "distributed_executor_backend" flag until the issue is fixed.

This PR adjusts our multi-gpu tests according to the above observations

@oandreeva-nv oandreeva-nv marked this pull request as ready for review July 17, 2024 23:13
@oandreeva-nv oandreeva-nv changed the title Oandreeva vllm 0.5.2 Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024
@rmccorm4 rmccorm4 changed the title Limiting multi-gpu tests to use Ray as distributed_executor_backend test: Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024
@@ -62,7 +62,8 @@ model_json=$(cat <<EOF
"enforce_eager": "true",
"enable_lora": "true",
"max_lora_rank": 32,
"lora_extra_vocab_size": 256
"lora_extra_vocab_size": 256,
"distributed_executor_backend":"ray"
Copy link
Contributor

@rmccorm4 rmccorm4 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired.

Few questions on this for my own understanding moving forward:

  1. Do we know more details or limitations on why this is happening?
  2. Is this an error happening on server shutdown?
    • Is there some issue with the python native multiprocessing due to the details of Triton's python backend launching each instance as a separate process?
  3. Is this with 1, 2, or any amount of model instances with KIND_MODEL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know more details or limitations on why this is happening?

Issue with unclear multi-gpu test fail when upgrading to 0.5.0 versions and up
In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed.
Solution: add "distributed_executor_backend":"ray" to model.json

Is this an error happening on server shutdown?

Yes, and I have a reproducer outside of triton

Is this with 1, 2, or any amount of model instances with KIND_MODEL?

If "distributed_executor_backend" field is not specified, than for tp>2 and distributed among a single node, than MP backend kicks in. However, I've noticed that even when tp=1 and "distributed_executor_backend" is specified in model.json, vllm will go through distributed serving even when tp=1. More on the slack channel for this behavior

@rcarrata
Copy link

@oandreeva-nv afaik you can setup the --distributed-executor-backend to ray and avoid the usage of MP.

From the docs of distributed serving:
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured tensor_parallel_size, otherwise Ray will be used. This default can be overridden via the LLM class distributed-executor-backend argument or --distributed-executor-backend API server argument. Set it to mp for multiprocessing or ray for Ray. It’s not required for Ray to be installed for the multiprocessing case.

@oandreeva-nv
Copy link
Contributor Author

@rcarrata That's what I'm doing in this PR. I'm making sure that ray is used for distributed testing. Or did I misunderstood your comment?

@oandreeva-nv oandreeva-nv merged commit 05c5a8b into main Jul 25, 2024
3 checks passed
@oandreeva-nv oandreeva-nv deleted the oandreeva_vllm_0.5.2 branch July 25, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants