Skip to content

[PD] Handle P/D failure and reconnect without affecting other instances #6263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
May 27, 2025

Conversation

ShangmingCai
Copy link
Collaborator

@ShangmingCai ShangmingCai commented May 13, 2025

Motivation

This PR enables Mooncake to handle both prefill and decode failures and reconnect without affecting other healthy instances. You can restart the dead/killed prefill instance or decode instance when it fails, and continue to pair it with on-the-fly instances.

This PR requires mooncake-transfer-engine >= 0.3.2.post1

CC: @whybeyoung @ByronHsu

Modifications

  • Add a health route in the bootstrap server, and add a heartbeat and clean thread on the decode side to handle prefill failures and abort requests.
  • Identify failures with failed mooncake sessions when the transfer sync timeout occurs, which usually means the decode instance has been killed, and abort requests from the same session.
  • Add a clear func for kv sender and receiver to make sure we clean the request status, which can prevent future failures caused by an identical bootstrap room. This should be helpful if users are using their own lb that might have a higher chance of generating the same bootstrap room for different requests after a period of time.
  • Record what makes the room fail in the prefill side and sync failure to the decode node.
  • Add failure exception impl to clear status and report the root cause.
  • Optimize the logging level of all logs related to PD with mooncake.
  • Remove the indices assertion to prevent a potential page size mismatch (only happens occasionally when PD+ chunked prefill + page size > 1, which is a bug that will be fixed) kills the transfer thread.
  • Reduce the thread pool size for kv transfer to prevent consuming too many CPU resources.
  • Add multiple env vars to enable user-specific configurations.

Checklist

@whybeyoung
Copy link
Collaborator

whybeyoung commented May 13, 2025

LGTM, thank you. Will test it .
It's a implement of feature request #6215

Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
@ShangmingCai ShangmingCai changed the title [PD] Handle prefill failure and reconnect without affecting decode instances [PD] Handle P/D failure and reconnect without affecting other instances May 25, 2025
@zhyncs zhyncs merged commit 3ce94f7 into sgl-project:main May 27, 2025
3 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants