-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[rollout] feat: introduce vLLM AsyncLLM to support multi-turn rollout #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some comments:
|
The idea case is that the construction of vllm worker and vllm single controller can be separate, and we can have a handler to the worker to make them build in the training process. Then, we point those handler to the vllm single controller. vllm single controller can be colocated with rl single controller |
In the architecture above, the AsyncLLMWorker is a cpu-only actor which only contains Scheduler and Executor, the GPU model runner is still colocate with FSDP model in same process. It's very complicated to colocate AsyncLLMWorker into single-controller or training process, since AsyncLLM's architecture is a combination of multiprocess and asyncio busy loop, see vllm-project/vllm#9826 |
Fantastic! I guess this is exactly what I expected! |
this is soooo amazing, I didn't imagine you can reuse existing ray actors and let them controlled by the vllm executor 👍 |
I was asked to review here based on #899 and https://github.com/casper-hansen/verl/tree/feature/interleaved-tool-calling. EDIT: I just read your ReTool paper. I am trying to achieve exactly the same thing, just with search rather than interpreter. Let me know if you need more clarification on my questions - main concern is data needed to achieve this behavior. ![]() My understanding is that the new async rollout is a great abstraction. This may achieve 3x higher throughput as described in the paper for Seed-Thinking-v1.5 if you decouple model evolution from runtime execution, but at least in the current implementation, I see less flexibility for specific inference methods. So I have some clarifying questions. Questions:
|
@casper-hansen For Question 1, I think we can achieve interleaved function calling by multi-turn rollouts:
We can achieve this in chat_completion callback: Also, maybe we can call tool at server side, see vLLM tool_calling, but I haven't investigate it yet. |
This sounds feasible. I just looked at the naive scheduler and I like the design. I can already see how to port my work from the interleaved tool calling, so I do expect to bring that to both the sync and async implementation. @wuxibin89 I do have a general request for this PR. This would really help me test this more easily if you could create the following. The reason I ask is that veRL can sometimes take 5-10 minutes to initialize.
|
@casper-hansen I add a small multi turn rollout test case, you can run with: python3 tests/rollout/test_vllm_multi_turn.py If everything is ok, you should see model outputs as below:
|
Looks good to me! Let's get this merged. |
736ca1f
to
da4222b
Compare
I switched to your branch and kick off a running with naive reward_manager, but ended with an error: ValueError: Failed to look up actor with name 'svWrqkWorkerDict_0:0'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor. Any quick fix? |
@KawaiiNotHawaii Maybe the actor |
Noob question, will we have query locality to make use of prefix caching? |
391b912
to
6357b88
Compare
Yes, ChatScheduler sticks multi turn rollout session to specific server by |
Hi @wuxibin89; this is awesome to see; let me know if you need any support from Ray side for this PR. |
EDIT: I realise this is a TODO @wuxibin89 I took a look again this morning and I am wondering how we can handle the attention mask? For example, when tool calling, the information you return should be masked as zeros in the attention mask for improved convergence. |
It would be awesome if we could write a quickstart on how to set up a model for interleaved code execution (perhaps in a dummy environment) - this is the future! |
@casper-hansen We should postprocess completions and convert them to DataProto, in postprocess we can mask out tool calling tokens. |
It seems openai api's retry logic will cause the error: Request id xxx already running. |
Do we have any examples for reference on this yet? Thanks. |
…volcengine#1138) ### Summary Introduce vLLM AsyncLLM to support multi-turn rollout and volcengine#385 volcengine#398 volcengine#710 ### Architecture  **New Components**: - AsyncLLMWorker: standalone vllm server instance - FastAPI: provide OpenAI-compatible HTTP server - AsyncLLM: async LLMEngine for online serving, for more details: [AsyncLLM](vllm-project/vllm#9826), [LLMEngine](https://docs.vllm.ai/en/latest/design/arch_overview.html#llmengine) - ExternalRayDistributedExecutor: custom executor backend manages workers in worker group, it grabs corresponding workers by actor names - AsyncLLManager: manages a group of vllm server instances(AsyncLLMWorker) - AsyncLLM lifecycle: initialization, wake_up, sleep. - FastAPI service discovery - ChatScheduler: schedule multiple chat completion requests with multiple server instances - Least requests load balance - Sticky session with prefix caching - Chat completion callback: tools calling ### TODO - [x] AsyncLLM: intialization/wake_up/sleep - [x] OpenAI API: support `/v1/chat/completions` - [x] RayPPOTrainer integration: replace `generate_sequences` to http call `/v1/chat/completions` - [x] GSM8K e2e training - [ ] Add document --------- Co-authored-by: shengguangming <[email protected]>
@wuxibin89
code
Could you provide some debugging suggestions? |
It seems that worker_group is not hold and causes ray actors being garbage collected, please post the full test function. |
Hello @wuxibin89 Here is the full test code.
Here is the complete log output.
My observation is that the larger the |
How does the actor sync weights to the vllm async server? I‘m not sure if I've missed something. Thanks. |
@dawson-chen I find the root cause of this problem. It's a bug in RayWorkerGroup holding weak references to these actors. Fixed in #1443 |
@wwd29 Weight sync between actor and async server works as follows
|
…1443) ### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Spawned RayWorkerGroup get actors by name, which holds a weak reference to the actor and causes actors garbage collected unexpectedly. Pass actor handle explicitly in spawn to make RayWorkerGroup have strong reference to these actors. close #1365 #1138 (comment) ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if neccessary.
Awesome man! It works. You are a genius. Recently, I've been looking over the async-rollout branches of SGLang and vLLM. verl-SGLang’s async is limited to each DP shard, so a long request can block the others. vLLM is fully async: every DP shard runs its own service, the main process fires requests in parallel, and jobs are load-balanced across shards with no waiting. Is that accurate? |
@wuxibin89 Thank you very much for your fix. However, on the latest |
@U-rara Maybe there's exception, do you check the exception in callback function? |
@wuxibin89 Thank you for your reply. My error is exactly the same as his: the completions passed to the callback are None, and there are no other error messages. Since everything works fine at shorter generation lengths (e.g. 8K), but this error only occurs when the length reaches 16K–32K, I suspect the issue originates from vLLM; however, because it’s in an asynchronous environment, the exception isn’t being caught. |
@wuxibin89 @U-rara I had this error too. It occurs because there is no check in My vLLM did throw an error. I don't have the full log, but it's something like the max length has been exceeded and then the actor dies like you see above in the error log. |
@U-rara After #1443 , this exception should not happen
If any exception happened in OpenAI client call, the exception should be passed to callback Please check it in your callback: async def callback(completions: ChatCompletion, info: Dict[str, Any], exception: Exception):
assert exception is None, f"generate_sequences failed: {exception}"
... |
@wuxibin89 @casper-hansen First, with a 32K inference length, I found that when
Upon investigation, I realized this error comes from the default two retries in: verl/verl/workers/rollout/async_server.py Lines 196 to 199 in f147ede
When I set
I suspect that because we didn’t explicitly set a timeout for client = AsyncOpenAI(
base_url=f"http://{address}/v1",
api_key="token-abc123",
timeout=None,
max_retries=0
) After testing, everything worked as expected. May I open a PR for this change? Considering this change seems to have been discussed before, I’m not sure whether it’s harmless. Please let me know if you have any other suggestions. |
@U-rara Feel free to open a PR. |
…lout (#1483) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? In Async rollout, `AsyncOpenAI` has a default 600-second timeout, which can lead to timeouts during longer inference. See details at #1138 (comment). ### High-Level Design See details at #1138 (comment). ### Specific Changes See details at #1138 (comment). ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if neccessary.
Does async rollout give the same performance as using sync one, i.e. if I keep everything else the same and use async rollout with one-turn chat scheduler, should I get the same performance? |
def init_worker(self, all_kwargs: List[Dict[str, Any]]): | ||
"""Initialize worker engine.""" | ||
all_kwargs[0]["rank"] = int(os.environ["RANK"]) | ||
all_kwargs[0]["local_rank"] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why local_rank is zero? 🤔
Summary
Introduce vLLM AsyncLLM to support multi-turn rollout and #385 #398 #710
Architecture
New Components:
AsyncLLMWorker: standalone vllm server instance
AsyncLLManager: manages a group of vllm server instances(AsyncLLMWorker)
ChatScheduler: schedule multiple chat completion requests with multiple server instances
TODO
/v1/chat/completions
generate_sequences
to http call/v1/chat/completions