Kubernetes Operator is blindly killing workers

Currently, dask's operator `2023.8.1` is using Kubernetes replicas to scale up / down the workers as seen in https://github.com/dask/dask-kubernetes/blob/7c09b57b05bfccc3d69ddc553d0ddf73a8ef063f/dask_kubernetes/operator/kubecluster/kubecluster.py#L745-L748

This results in cases such as https://github.com/dask/dask-kubernetes/issues/659, Since Kubernetes doesn't know the state or data stored in workers it would kill those workers in an attempt to scale up/down as requested by the operator resulting in instability issues or partial data loss if it interrupted data moving operation.

Scaling up wouldn't cause much trouble as it's just adding new workers, however, problems occur during the scaling down   


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Kubernetes Operator is blindly killing workers #807

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	wg = await DaskWorkerGroup(
	f"{self.name}-{worker_group}", namespace=self.namespace
	)
	await wg.scale(n)

Uh oh!

Kubernetes Operator is blindly killing workers #807

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions