-
-
Notifications
You must be signed in to change notification settings - Fork 153
Kubernetes Operator is blindly killing workers #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I feel like I should add more context so here it is, after some digging around I found that I was wrong and Dask is actually trying to shut down workers as it should using retire_workers however seems like this function had multiple layers of fallback. my cluster was suffering from unexpectedly killed workers and to narrow down the fallback cases of the function I had to enable |
Thanks for sharing your findings here. I'm glad enabling the HTTP API resolved this for you. Our goal is to enable the API by default in future versions of distributed but there are ongoing discussions on how we should authenticate it. The last fallback strategy is a worst case scenario but I wonder if we can do more to highlight to the user that this is happening. |
Thanks @jacobtomlinson for taking the time to look into closed issues. A warning suggesting a problem with dask RPC would have helped a lot in this case. For the authentication I implemented my own authentication gateway so that's not a concern |
Where would it be useful to surface this warning? The problem happens within the controller, so would a warning log line be enough? Or do you mean passing a warning back to the
Yeah I would expect this to be the case for many users. However some folks expose their dashboard to the internet so we treat it as a read-only resource. Enabling the API turns it into a read/write resource and we should probably implement some kind of default authenticaiton. |
I believe a warning within the controller would be enough |
Sorry for re-writing on a closed issue, but what is the current way to remedy this? Makes auto-scaling very problematic. We end up losing a lot of active workers which slow down the whole system. Worker count goes under our "min" workers quite often. P.S. is activating the HTTP API the only way to go? I couldn't find much information about how to turn it on in distributed documentation. Is it just a matter of setting |
@tasansal it would be interesting if you could check your logs and see why it is falling back to LIFO scaling. It should fall back to the RPC if you don't have the HTTP API enabled. |
Currently, dask's operator
2023.8.1
is using Kubernetes replicas to scale up / down the workers as seen indask-kubernetes/dask_kubernetes/operator/kubecluster/kubecluster.py
Lines 745 to 748 in 7c09b57
This results in cases such as #659, Since Kubernetes doesn't know the state or data stored in workers it would kill those workers in an attempt to scale up/down as requested by the operator resulting in instability issues or partial data loss if it interrupted data moving operation.
Scaling up wouldn't cause much trouble as it's just adding new workers, however, problems occur during the scaling down
The text was updated successfully, but these errors were encountered: