You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I Have a microk8s setup with three master nodes and one worker node.
lukas@ckf-ha-1:~$ sudo microk8s kubectl get no
NAME STATUS ROLES AGE VERSION
ckf-ha-1 Ready <none> 47h v1.29.15
ckf-ha-2 Ready <none> 47h v1.29.15
ckf-ha-3 Ready <none> 47h v1.29.15
ckf-ha-worker Ready <none> 47h v1.29.15
I also have set up microceph on the same nodes with one vdisk each as OSD.
On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.
After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:
lukas@ckf-ha-1:~$ sudo microk8s kubectl get pods -n lukas
Error from server: rpc error: code = Unknown desc = (
SELECT MAX(rkv.id) AS id
FROM kine AS rkv)
I assume there is to many failed or completed pods within the namespace.
other namepsaces are still ok.
Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.
Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.
What Should Happen Instead?
Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.
I hope you're doing well. Thank you for filing your issue.
Could you please send a microk8s inspection report to help us debug the issue? I'm curious about the workload you are running on your cluster- would you be happy to share it?
Summary
I Have a microk8s setup with three master nodes and one worker node.
I also have set up microceph on the same nodes with one vdisk each as OSD.
On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.
After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:
I assume there is to many failed or completed pods within the namespace.
other namepsaces are still ok.
Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.
Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.
What Should Happen Instead?
Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.
Reproduction Steps
microk8s kubectl edit pvc kfp-db-database-XXX-kfp-db-0 -n kubeflow
Introspection Report
Need to recreate this again to provide the Introspection report.
I can also check if I can reproduce it on a single node setup.
Can you suggest a fix?
Are you interested in contributing with a fix?
Possibly
The text was updated successfully, but these errors were encountered: