Skip to content

Kubernetes Operator is blindly killing workers #807

Closed
@BitTheByte

Description

@BitTheByte

Currently, dask's operator 2023.8.1 is using Kubernetes replicas to scale up / down the workers as seen in

wg = await DaskWorkerGroup(
f"{self.name}-{worker_group}", namespace=self.namespace
)
await wg.scale(n)

This results in cases such as #659, Since Kubernetes doesn't know the state or data stored in workers it would kill those workers in an attempt to scale up/down as requested by the operator resulting in instability issues or partial data loss if it interrupted data moving operation.

Scaling up wouldn't cause much trouble as it's just adding new workers, however, problems occur during the scaling down

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions