Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken namespace after too many pods #4999

Open
Naegionn opened this issue Apr 1, 2025 · 2 comments
Open

Broken namespace after too many pods #4999

Naegionn opened this issue Apr 1, 2025 · 2 comments

Comments

@Naegionn
Copy link

Naegionn commented Apr 1, 2025

Summary

I Have a microk8s setup with three master nodes and one worker node.

lukas@ckf-ha-1:~$ sudo microk8s kubectl get no
NAME       STATUS   ROLES    AGE   VERSION
ckf-ha-1   Ready    <none>   47h   v1.29.15
ckf-ha-2   Ready    <none>   47h   v1.29.15
ckf-ha-3   Ready    <none>   47h   v1.29.15
ckf-ha-worker   Ready    <none>   47h   v1.29.15

I also have set up microceph on the same nodes with one vdisk each as OSD.
On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.

After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:

lukas@ckf-ha-1:~$ sudo microk8s kubectl get pods -n lukas
Error from server: rpc error: code = Unknown desc = (
                SELECT MAX(rkv.id) AS id
                FROM kine AS rkv)

I assume there is to many failed or completed pods within the namespace.
other namepsaces are still ok.
Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.

Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.

What Should Happen Instead?

Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.

Reproduction Steps

  1. setup microceph, microk8s and CKF
  2. Increase kfp-db-0 pv size microk8s kubectl edit pvc kfp-db-database-XXX-kfp-db-0 -n kubeflow
  3. create large amounts of kfp runs
  4. here we are

Introspection Report

Need to recreate this again to provide the Introspection report.
I can also check if I can reproduce it on a single node setup.

Can you suggest a fix?

Are you interested in contributing with a fix?

Possibly

@louiseschmidtgen
Copy link
Contributor

louiseschmidtgen commented Apr 2, 2025

Hi @Naegionn!

I hope you're doing well. Thank you for filing your issue.

Could you please send a microk8s inspection report to help us debug the issue? I'm curious about the workload you are running on your cluster- would you be happy to share it?

Best regards,
Louise

@Naegionn
Copy link
Author

Naegionn commented Apr 7, 2025

@louiseschmidtgen
Hi here is the inspection report
inspection-report-20250325_234441.tar.gz

Currently I am just testing Kubeflow Pipelines Stability to make sure it can handle thousands of jobs per day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants