Broken namespace after too many pods #4999

Naegionn · 2025-04-01T22:01:55Z

Summary

I Have a microk8s setup with three master nodes and one worker node.

lukas@ckf-ha-1:~$ sudo microk8s kubectl get no
NAME       STATUS   ROLES    AGE   VERSION
ckf-ha-1   Ready    <none>   47h   v1.29.15
ckf-ha-2   Ready    <none>   47h   v1.29.15
ckf-ha-3   Ready    <none>   47h   v1.29.15
ckf-ha-worker   Ready    <none>   47h   v1.29.15

I also have set up microceph on the same nodes with one vdisk each as OSD.
On this cluster I installed CKF 1.9 and I created a lot of pipeline runs in the default users profile.

After some while I suspect there is some limit within dqlite reached and the namespace becomes unusable:

lukas@ckf-ha-1:~$ sudo microk8s kubectl get pods -n lukas
Error from server: rpc error: code = Unknown desc = (
                SELECT MAX(rkv.id) AS id
                FROM kine AS rkv)

I assume there is to many failed or completed pods within the namespace.
other namepsaces are still ok.
Deleting this namespace does still behave weirdly as it reports successfull deletion but running get pods on the deleted namespace still returns the same error.

Recreating the cluster but running a manual garbage collection jobs scheduled to run every 3 hours to delete all failed and completed pods seems to mitigate this issue.

What Should Happen Instead?

Running thousands to hundredthousand kfp runs should not corrupt the namespace or dqlite.

Reproduction Steps

setup microceph, microk8s and CKF
Increase kfp-db-0 pv size microk8s kubectl edit pvc kfp-db-database-XXX-kfp-db-0 -n kubeflow
create large amounts of kfp runs
here we are

Introspection Report

Need to recreate this again to provide the Introspection report.
I can also check if I can reproduce it on a single node setup.

Can you suggest a fix?

Are you interested in contributing with a fix?

Possibly

The text was updated successfully, but these errors were encountered:

louiseschmidtgen · 2025-04-02T13:48:49Z

Hi @Naegionn!

I hope you're doing well. Thank you for filing your issue.

Could you please send a microk8s inspection report to help us debug the issue? I'm curious about the workload you are running on your cluster- would you be happy to share it?

Best regards,
Louise

Naegionn · 2025-04-07T12:16:16Z

@louiseschmidtgen
Hi here is the inspection report
inspection-report-20250325_234441.tar.gz

Currently I am just testing Kubeflow Pipelines Stability to make sure it can handle thousands of jobs per day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken namespace after too many pods #4999

Broken namespace after too many pods #4999

Naegionn commented Apr 1, 2025

louiseschmidtgen commented Apr 2, 2025 •

edited

Loading

Naegionn commented Apr 7, 2025

Broken namespace after too many pods #4999

Broken namespace after too many pods #4999

Comments

Naegionn commented Apr 1, 2025

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

louiseschmidtgen commented Apr 2, 2025 • edited Loading

Naegionn commented Apr 7, 2025

louiseschmidtgen commented Apr 2, 2025 •

edited

Loading