Skip to content

Hope lws supports a pause feature similar to deployment #506

Open
@loda13

Description

@loda13

What would you like to be added:

  1. It is hoped that lws can support a pause function similar to deployment, enabling the suspension of updates. After the pause is triggered, it should ensure that the current group of leaderworkers undergoing updates completes the update before pausing.
  2. support resume

Why is this needed:

In the current PD separate deployment setup, the PD is deployed by two LWS instances, and updating the PD requires both LWS instances to jointly complete this update.

During a complete update process, it is necessary to ensure that the old and new versions of P can always find their corresponding versions of D nodes, in order to avoid the error of being unable to decode after the prefill request is completed.

  1. D nodes are updated first, followed by P nodes, to ensure that the updated P nodes can always connect to the corresponding updated D nodes.

  2. After all P nodes are fully updated, the remaining D nodes are updated to ensure that any remaining old P nodes can still work with old D nodes during the transition.

Completion requirements:

Shell           D (Decode)         P (Prefill)
  |                |                   |
  | -- Trigger D update -------------> |
  | <----------- D reports ready ----- |
  |                | -- Trigger P update -------------> |
  |                | <----------- P reports ready ----- |
  |                | -- Continue P batch update -------> |
  |                | <------ P batch update done ------- |
  | -- Trigger D batch update -------> |
  | <------ D batch update done ------ |

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions